2014-02-06 67 views
0

我想將輸入文件中的一組變量匹配到我的數據文件並返回各種字段。使用perl匹配txt文件中的變量集

input.txt中

ENSG00000165322 
ENSG00000170540 
ENSG00000143153 
ENSG00000213145 

data.txt文件包含由(我覺得)分隔的多個字段分號(;)。這裏有一個例子:

chr10 gencodeV7 gene 32094365 32217742 0.714042 - . gene_id "ENSG00000165322.12"; transcript_ids "ENST00000311380.4,ENST00000375250.5,ENST00000492028.1,ENST00000497085.1,ENST00000493008.1,ENST00000344936.2,ENST00000396144.4,ENST00000375245.4,ENST00000477117.1,ENST00000497103.1,ENST00000454919.1,"; RPKM1 "7.54177"; RPKM2 "9.47656"; iIDR "0.000"; 
chr16 gencodeV7 gene 18802991 18812917 7.333434 - . gene_id "ENSG00000170540.7"; transcript_ids "ENST00000304414.6,ENST00000545430.1,ENST00000546206.1,"; RPKM1 "84.0696"; RPKM2 "90.714"; iIDR "0.000"; 

我想在input.txt中每個變量與數據文件相匹配,並與RPKM1打印出匹配的術語,它是在雙引號關聯的值,並用它來RPKM2值的相應數值,以便它看起來像這樣哪裏還有不匹配打印出A N/A

ENSG00000165322 7.54177 9.47656 
ENSG00000170540 84.0696 90.714 
ENSG00000143153 73.2162 85.090 
ENSG00000213145 N/A N/A 

我可以使用這個腳本使用awk做到這一點:

exec < input.txt 
while read line 
      do 
      set $line 
        rpkm=`grep $1 data.txt | cut -f9| cut -d";" -f 3-4 | sed -e 's/;/\t/g'` 
        echo $line $rpkm >> output.txt 

     done 

,但我嘗試爲了學習perl和搜索後幾小時,我已經嘗試了這個,但我不知道如何獲得輸出。

use strict; 
    use warnings; 
    my $input_txt = "input.txt" ; 
    my $raw_data = "data.txt" ; 
    if ($input_txt =~ $raw_data) ; 
close $input 

如果您有任何建議和解釋,那將是美好的。

+0

'perldoc perlintro' – toolic

+0

我們可以稱之爲變量嗎? RPKM2「9.47656」 – Sekai

+0

變量是input.txt,例如ENSG00000165322等。我想從data.txt文件的input.txt中找到變量,並將其與相應的RPKM1和RPKM2值一起打印出來。希望這可以幫助? – user1879573

回答

1

我的Perl技能是一個有點生鏽,但我把它放在一起給你。我使用您在問題中提供的數據文件片段對其進行了測試,並且它可以工作(除了您提供的數據示例沒有爲ENSG00000143153提供一行,因此輸出將顯示N/A)。

不確定您的gene_id是否包含或排除點後的內容。在你的例子中,它似乎排除,所以這就是我所做的。 (有一個註釋掉的正則表達式,你可以使用,如果我錯誤地假設)。

我試圖在perl代碼中添加足夠的註釋,以便您能夠理解我一路上正在做的事情。

希望這可以幫助你!

#!/usr/bin/perl 
use strict; 
use warnings; 

my $input_file = 'input.txt'; 
my $data_file = 'data.txt'; 

# Read input file into array of variables 
my @input_vars; 
open my $input_file_handle, '<', $input_file or die $!; 
while (<$input_file_handle>) { 
    chomp $_; 
    push @input_vars, $_; 
} 
close $input_file_handle; 

# Read data file into array of data lines 
my @data_lines; 
open my $data_file_handle, '<', $data_file or die $!; 
while (<$data_file_handle>) { 
    chomp $_; 
    push @data_lines, $_; 
} 
close $data_file_handle; 

# Pare down data lines because we only care about gene_id, RPKM1, and RPKM2 
# Create 2 associative arrays which store RPKM1 and RPKM2 values based on the gene_id as the key 
my %rpkm1s; 
my %rpkm2s; 
foreach (@data_lines) { 
    # If the gene id should exclude everything after the dot, as in your example. 
    my $regex = 'gene_id(?:[ ]*)"(\w+)(?:\.\d+)?"(?:.*)RPKM1(?:[ ]*)"([0-9\.]+)"(?:.*)RPKM2(?:[ ]*)"([0-9\.]+)"'; 

    # If the gene id includes the dot and what's after it. 
    # my $regex = 'gene_id(?:[ ]*)"(\w+\.\d+)"(?:.*)RPKM1(?:[ ]*)"([0-9\.]+)"(?:.*)RPKM2(?:[ ]*)"([0-9\.]+)"'; 

    while ($_ =~ m/$regex/g) { 
    # $1 is gene_id, $2 is RPKM1, and $3 is RPKM2 
    # Set RPKM1 value in array based on gene_id as the key 
    $rpkm1s{$1} = $2; 
    # Set RPKM2 value in array based on gene_id as the key 
    $rpkm2s{$1} = $3; 
    } 
} 

# Verify that I have gene_ids mapped to RPKM1 and RPKM2 values 
# while ((my $gene_id, my $rpkm1) = each(%rpkm1s)) { 
# print "GENE ID: $gene_id\n"; 
# print "\tRPKM1: $rpkm1\n"; 
# print "\tRPKM2: $rpkm2s{$gene_id}\n"; 
# print "\n"; 
# } 

# Iterate through input variables, search for values in %rpkm1s and %rpkm2s 
foreach (@input_vars) { 
    print "$_ "; 
    if (exists $rpkm1s{$_}) { 
    print "$rpkm1s{$_} "; 
    } 
    else { 
    print "N/A "; 
    } 

    if (exists $rpkm2s{$_}) { 
    print "$rpkm2s{$_} "; 
    } 
    else { 
    print "N/A "; 
    } 
    print "\n"; 
} 
0

這裏是配襯您的變量正則表達式:

([a-z]{1}[A-Z]{3} "[0-9]\.[0-9]{3}") 

我不熟悉Perl,但這個表達式將返回一組變量,你可以對他們進行迭代