2017-06-08 27 views
1

我有以下input刪除字符串,並添加序列號,文件用awk或sed的頭

>Thimo_0001|ID:40710520| hypothetical protein [Thioflavicoccus mobilis 8321] 
LIAPTMILRIRLTEFCPMRTEGFEE 
TGIGPLDSRMPRYDDVVHHREIIT 
YPPEALSNDPFDPTSIDGSPSAFF* 
>ThimoAM_0002|ID:40707134| protein of unknown function [Thioflavicoccus mobilis 8321] 
VRKAERDSPCKRRGADRSFP 
KSARLISSKAFRDVFAESITNSDPFFVVR 
ARPNLAETARLGIAVSKKCARRSVDRSRIKRII 
RESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA* 
>Thimo_0002|ID:40710524| ribonuclease P protein component [Thioflavicoccus mobilis 8321] 
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRAR 
TTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAP 
RRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL* 

而且我想

  1. 刪除行的換行符的頭開始>
  2. 刪除星號
  3. 更改fasta標頭

我可以做1.2.

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' 
sed "s/\*//g" 

,我還可以添加一個序列號,標題行的末尾:

awk '/^>/{$0=$0"_"(++i)}1' 

但我在與最後一步失敗替換/刪除和添加序號:

想要的output

>TM0001|hypothetical_protein 
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF 
>TM0002|protein_of_unknown_function 
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA 
>TM0003|ribonuclease_P_protein_component 
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL 

回答

1

根據你的 「理想」 輸出 - GAWK解決方案:

awk 'BEGIN{ RS=">"; FS="[|\\]\\[]" }!$0{ next } 
    { gsub(/^ */,"",$3); gsub(/[*[:space:]]/,"",$5); printf(">TM%04d|%s\n%s\n",++c,$3,$5) 
}' yourfile 

輸出:

>TM0001|hypothetical protein 
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF 
>TM0002|protein of unknown function 
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA 
>TM0003|ribonuclease P protein component 
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL 

詳情:

  • RS=">" - 考慮>作爲記錄分隔

  • FS="[|\\]\\[]" - 字段分隔,任意字符|[]

  • !$0{ next }的 - 跳過空記錄

  • gsub(/^ */,"",$3) - 刪除前導空格在第三場

  • gsub(/[*[:space:]]/,"",$5) - 更換/刪除翠菊isk *和第五個字段內的空格字符