使用perl匹配txt文件中的變量集

我想將輸入文件中的一組變量匹配到我的數據文件並返回各種字段。使用perl匹配txt文件中的變量集

input.txt中

ENSG00000165322 
ENSG00000170540 
ENSG00000143153 
ENSG00000213145

data.txt文件包含由（我覺得）分隔的多個字段分號（;）。這裏有一個例子：

chr10 gencodeV7 gene 32094365 32217742 0.714042 - . gene_id "ENSG00000165322.12"; transcript_ids "ENST00000311380.4,ENST00000375250.5,ENST00000492028.1,ENST00000497085.1,ENST00000493008.1,ENST00000344936.2,ENST00000396144.4,ENST00000375245.4,ENST00000477117.1,ENST00000497103.1,ENST00000454919.1,"; RPKM1 "7.54177"; RPKM2 "9.47656"; iIDR "0.000"; 
chr16 gencodeV7 gene 18802991 18812917 7.333434 - . gene_id "ENSG00000170540.7"; transcript_ids "ENST00000304414.6,ENST00000545430.1,ENST00000546206.1,"; RPKM1 "84.0696"; RPKM2 "90.714"; iIDR "0.000";

我想在input.txt中每個變量與數據文件相匹配，並與RPKM1打印出匹配的術語，它是在雙引號關聯的值，並用它來RPKM2值的相應數值，以便它看起來像這樣哪裏還有不匹配打印出A N/A

ENSG00000165322 7.54177 9.47656 
ENSG00000170540 84.0696 90.714 
ENSG00000143153 73.2162 85.090 
ENSG00000213145 N/A N/A

我可以使用這個腳本使用awk做到這一點：

exec < input.txt 
while read line 
      do 
      set $line 
        rpkm=`grep $1 data.txt | cut -f9| cut -d";" -f 3-4 | sed -e 's/;/\t/g'` 
        echo $line $rpkm >> output.txt 

     done

，但我嘗試爲了學習perl和搜索後幾小時，我已經嘗試了這個，但我不知道如何獲得輸出。

use strict; 
    use warnings; 
    my $input_txt = "input.txt" ; 
    my $raw_data = "data.txt" ; 
    if ($input_txt =~ $raw_data) ; 
close $input

如果您有任何建議和解釋，那將是美好的。

來源

2014-02-06 user1879573

'perldoc perlintro' – toolic

我們可以稱之爲變量嗎？ RPKM2「9.47656」 – Sekai

變量是input.txt，例如ENSG00000165322等。我想從data.txt文件的input.txt中找到變量，並將其與相應的RPKM1和RPKM2值一起打印出來。希望這可以幫助？ – user1879573

我的Perl技能是一個有點生鏽，但我把它放在一起給你。我使用您在問題中提供的數據文件片段對其進行了測試，並且它可以工作（除了您提供的數據示例沒有爲ENSG00000143153提供一行，因此輸出將顯示N/A）。

不確定您的gene_id是否包含或排除點後的內容。在你的例子中，它似乎排除，所以這就是我所做的。（有一個註釋掉的正則表達式，你可以使用，如果我錯誤地假設）。

我試圖在perl代碼中添加足夠的註釋，以便您能夠理解我一路上正在做的事情。

希望這可以幫助你！

#!/usr/bin/perl 
use strict; 
use warnings; 

my $input_file = 'input.txt'; 
my $data_file = 'data.txt'; 

# Read input file into array of variables 
my @input_vars; 
open my $input_file_handle, '<', $input_file or die $!; 
while (<$input_file_handle>) { 
    chomp $_; 
    push @input_vars, $_; 
} 
close $input_file_handle; 

# Read data file into array of data lines 
my @data_lines; 
open my $data_file_handle, '<', $data_file or die $!; 
while (<$data_file_handle>) { 
    chomp $_; 
    push @data_lines, $_; 
} 
close $data_file_handle; 

# Pare down data lines because we only care about gene_id, RPKM1, and RPKM2 
# Create 2 associative arrays which store RPKM1 and RPKM2 values based on the gene_id as the key 
my %rpkm1s; 
my %rpkm2s; 
foreach (@data_lines) { 
    # If the gene id should exclude everything after the dot, as in your example. 
    my $regex = 'gene_id(?:[ ]*)"(\w+)(?:\.\d+)?"(?:.*)RPKM1(?:[ ]*)"([0-9\.]+)"(?:.*)RPKM2(?:[ ]*)"([0-9\.]+)"'; 

    # If the gene id includes the dot and what's after it. 
    # my $regex = 'gene_id(?:[ ]*)"(\w+\.\d+)"(?:.*)RPKM1(?:[ ]*)"([0-9\.]+)"(?:.*)RPKM2(?:[ ]*)"([0-9\.]+)"'; 

    while ($_ =~ m/$regex/g) { 
    # $1 is gene_id, $2 is RPKM1, and $3 is RPKM2 
    # Set RPKM1 value in array based on gene_id as the key 
    $rpkm1s{$1} = $2; 
    # Set RPKM2 value in array based on gene_id as the key 
    $rpkm2s{$1} = $3; 
    } 
} 

# Verify that I have gene_ids mapped to RPKM1 and RPKM2 values 
# while ((my $gene_id, my $rpkm1) = each(%rpkm1s)) { 
# print "GENE ID: $gene_id\n"; 
# print "\tRPKM1: $rpkm1\n"; 
# print "\tRPKM2: $rpkm2s{$gene_id}\n"; 
# print "\n"; 
# } 

# Iterate through input variables, search for values in %rpkm1s and %rpkm2s 
foreach (@input_vars) { 
    print "$_ "; 
    if (exists $rpkm1s{$_}) { 
    print "$rpkm1s{$_} "; 
    } 
    else { 
    print "N/A "; 
    } 

    if (exists $rpkm2s{$_}) { 
    print "$rpkm2s{$_} "; 
    } 
    else { 
    print "N/A "; 
    } 
    print "\n"; 
}

來源

2014-02-07 06:26:52

這裏是配襯您的變量正則表達式：

([a-z]{1}[A-Z]{3} "[0-9]\.[0-9]{3}")

我不熟悉Perl，但這個表達式將返回一組變量，你可以對他們進行迭代

來源

2014-02-06 15:21:03 Sekai

使用perl匹配txt文件中的變量集

回答

相關問題