1
我有以下input
:刪除字符串,並添加序列號,文件用awk或sed的頭
>Thimo_0001|ID:40710520| hypothetical protein [Thioflavicoccus mobilis 8321]
LIAPTMILRIRLTEFCPMRTEGFEE
TGIGPLDSRMPRYDDVVHHREIIT
YPPEALSNDPFDPTSIDGSPSAFF*
>ThimoAM_0002|ID:40707134| protein of unknown function [Thioflavicoccus mobilis 8321]
VRKAERDSPCKRRGADRSFP
KSARLISSKAFRDVFAESITNSDPFFVVR
ARPNLAETARLGIAVSKKCARRSVDRSRIKRII
RESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA*
>Thimo_0002|ID:40710524| ribonuclease P protein component [Thioflavicoccus mobilis 8321]
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRAR
TTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAP
RRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL*
而且我想
- 刪除行的換行符的頭開始
>
後 - 刪除星號
- 更改fasta標頭
我可以做1.
和2.
awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }'
sed "s/\*//g"
,我還可以添加一個序列號,標題行的末尾:
awk '/^>/{$0=$0"_"(++i)}1'
但我在與最後一步失敗替換/刪除和添加序號:
想要的output
>TM0001|hypothetical_protein
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF
>TM0002|protein_of_unknown_function
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA
>TM0003|ribonuclease_P_protein_component
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL