2016-10-21 25 views
-1

所以我一直在解決這個問題一段時間了。在使用Perl正則表達式捕獲存儲字符串時遇到問題?

我與百個FASTA序列文件安排是這樣的:

> GI | 192567 | GB | AAA37417.1 |囊性纖維化跨膜傳導調節[小家鼠] MQKSPLEKASFISKLFFSWTTPILRKGYRHHLELSDIYQAPSADSADHLSEKLEREWDREQASKKNPQLIHALRRCFFWRFLFYGILLYLGEVTKAVQPVLLGRIIASYDPENKVERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHRIGMQMRTAMFSLIYKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFIWIAPLQVTLLMGLLWDLLQFSAFCGLGLLIILVIFQAILGKMMVKYRDQRAAKINERLVITSEIIDNIYSVKAYCWESAMEKMIENLREVELKMTRKAAYMRFFTSSAFFFSGFFVVFLSVLPYTVINGIVLRKIFTTISFCIVLRMSVTRQFPTAVQIWYDSFGMIRKIQDFLQKQEYKVLEYNLMTTGIIMENVTAFWEEGFGELLQKAQQSNGDRKHSSDENNVSFSHLCLVGNPVLKNINLNIEKGEMLAITGSTGLGKTSLLMLILGELEASEGIIKHSGRVSFCSQFSWIMPGTIKENIIFGVSYDEYRYKSVVKACQLQQDITKFAEQDNTVLGEGGVTLSGGQRARISL​​ARAVYKDADLYLLDSPFGYLDVFTEEQVFESCVCKLMANKTRILVTSKMEHLRKADKILILHQGTSYFYGTFSELQSLRPSFSSKLMGYDTFDQFTEERRSSILTETLRRFSVDDSSAPWSKPKQSFRQTGEVGEKRKNSILNSFSSVRKISIVQKTPLCIDGESDDLQEKRLSLVPDSEQGEAALPRSNMIATGPTFPGRRRQSVLDLMTFTPNSGSSNLQRTRTSIRKISLVPQISLNEVDVYSRRLSQDSTLNITEEINEEDLKECFLDDVIKIPPVTTWNTYLRYFTLHKGLLLVLIWCVLVFLVEVAASLFVLWLLKNNPVNSGNNGTKISNSSYVVI ITSTSFYYIFYIYVGVADTLLALSLFRGLPLVHTLITASKILHRKMLHSILHAPMSTISKLKAGGILNRFSKDIAILDDFLPLTIFDFIQLVFIVIGAIIVVSALQPYIFLATVPGLVVFILLRAYFLHTAQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFRRQTYFETLFHKALNLHTANWFMYLATLRWFQMRIDMIFVLFFIVVTFISILTTGEGEGTAGIILTLAMNIMSTLQWAVNSSIDTDSLMRSVSRVFKFIDIQTEESMYTQIIKELPREGSSDVLVIKNEHVKKSDIWPSGGEMVVKDLTVKYMDDGNAVLENISFSISPGQRVGLLGRTGSGKSTLLSAFLRMLNIKGDIEIDGVSWNSVTLQEWRKAFGVITQKVFIFSGTFRQNLDPNGKWKDEEIWKVADEVGLKSVIEQFPGQLNFTLVDGGYVLSHGHKQLMCLARSVLSKAKIILLDEPSAHLDPITYQVIRRVLKQAFAGCTVILCEHRIEAMLDCQRFLVIEESNVWQYDSLQALLSEKSIFQQAISSSEKMRFFQGRHSSKHKPRTQITALKEETEEEVQETRL

我寫,打開該文件的子程序,並在同一時間讀取每個序列中的一個。對於每個序列,我希望在開頭添加gi編號,將大寫字母的長序列作爲字符串添加到不斷增長的數組中。但是,我無法寫出正則表達式來存儲這些值。這是我目前的子程序,我調整了,看看我其實是存儲GI號:

sub getFASTA { 
    my ($filename) = @_; 
    my @FASTA_arr; 
    $/ = "\n\n"; 
    open (my $fh, '<', $filename) or 
      die ("Could not open file: $filename"); 
    while (<$fh>) { 
      chomp $_; 
      $_ =~ /^>gi|(\d*?)|/s; 
      say "$1"; 
    } 
    close $fh; 
    #say join(" ", @FASTA_arr); 
} 

但是,試圖運行這將返回:

Use of uninitialized value $1 in string at sequenceAlignment.pl line 30, <$fh> chunk 1. 

這將返回每個序列,所以總共100次。

那麼,什麼是錯的想法?我幾乎可以肯定,這是一個正則表達式的問題,因爲當我將它改爲「$ _ =〜/(> gi |)/ s;」時,它正常工作,只需要100「> gi |」s打印出來。

+1

你需要在正則表達式中逃避管道:'$ _ =〜/ ^> gi \ |(\ d *?)\ |/s' –

回答

0

|意味着在正則表達式中的OR。逃脫它。 (看起來像perl知道你在捕獲組結束時的「真正」含義,並且沒有第二個操作數)

+0

謝謝你,就是這樣!我沒有考慮特殊字符 –

相關問題