awk - Trim FASTA headers with sed

Question

Welcome To Ask or Share your Answers For Others

awk - Trim FASTA headers with sed

asked Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

awk - Trim FASTA headers with sed

I have a reference genome containing the following headers (lines starting with >) that I would like to be renamed to simply the digit/letter of the chromosomes. I would like a sed statement to do this systematic replacement, but I am new to sed. Elsewhere in the file are additional headers that should be unchanged, and the genetic sequences between the headers should remain unchanged.

>ST078050.1 Ovis aries is a sheep chromosome 1, whole genome shotgun sequence
>ST078051.1 Ovis aries is a sheep chromosome 2, whole genome shotgun sequence
>ST078052.1 Ovis aries is a sheep chromosome 3, whole genome shotgun sequence
>ST078053.1 Ovis aries is a sheep chromosome 4, whole genome shotgun sequence
>ST078054.1 Ovis aries is a sheep chromosome 5, whole genome shotgun sequence
>ST078055.1 Ovis aries is a sheep chromosome 6, whole genome shotgun sequence
>ST078056.1 Ovis aries is a sheep chromosome 7, whole genome shotgun sequence
>ST078057.1 Ovis aries is a sheep chromosome 8, whole genome shotgun sequence
>ST078058.1 Ovis aries is a sheep chromosome 9, whole genome shotgun sequence
>ST078059.1 Ovis aries is a sheep chromosome 10, whole genome shotgun sequence
>ST078079.1 Ovis aries is a sheep chromosome X, whole genome shotgun sequence
>ST078080.1 Ovis aries is a sheep chromosome Y, whole genome shotgun sequence

Output should be:

>1
>2
>3
>4
>5
>6
>7
>8
>9
>10
>X
>Y

I tried the following, but it's not right.

sed 's/^.*(chromosome.*,).*$/1/' file

Thank you!

question from:https://stackoverflow.com/questions/66067552/trim-fasta-headers-with-sed

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T03:01:40+0000

Assuming that the above are just some headers of actual fasta files, and the remaining sequence is still in the files, then the following solutions will do the job:

$ sed '/^>/{s/,.*//;s/^.* />/}' file.fasta
$ awk '/^>/{sub(/,.*$/,"");$0=">"$NF}1' file.fasta

Both methods do exactly the same. In the line that starts with a >, remove the string starting with a , till the end and replace everything upto the last space with a >. The latter is done in awk by simple calling the last field.

Categories

awk - Trim FASTA headers with sed

awk - Trim FASTA headers with sed

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags