從一個單獨的列表的文本文件選擇數據 - Perl或UNIX

我有一個龐大的製表符分隔的文件是這樣的：從一個單獨的列表的文本文件選擇數據 - Perl或UNIX

contig04733 contig00012 77個
contig00546 contig01344 12個
contig08943 contig00001 14個
contig00765 contig03125 88
等

而且我有一個單獨的製表符分隔的文件，只有這樣，這些重疊羣對集：

contig04733 contig00012
contig08943 contig00001
等

我想提取到一個新的文件中，與在第二列出的對應所述第一文件中的行。在這個特定的數據集中，我認爲兩個文件中的每對應該是相同的。但也想知道，如果你說：

文件1 contig08943 contig00001 14

但file2中的

contig00001 contig08943

我還是希望這樣的組合，有沒有可能爲此編寫腳本？

我的代碼如下。

use strict; 
use warnings; 

#open the contig pairs list 
open (PAIRS, "$ARGV[0]") or die "Error opening the input file with contig pairs"; 

#hash to store contig IDs - I think?! 
my $pairs; 

#read through the pairs list and read into memory? 
while(<PAIRS>){ 
    chomp $_; #get rid of ending whitepace 
    $pairs->{$_} = 1; 
} 
close(PAIRS); 

#open data file 
open(DATA, "$ARGV[1]") or die "Error opening the sequence pairs file\n"; 
while(<DATA>){ 
    chomp $_; 
    my ($contigs, $identity) = split("\t", $_); 
    if (defined $pairs->{$contigs}) { 
     print STDOUT "$_\n"; 
    } 
} 
close(DATA);

來源

2013-03-08 Amy Ellison

另外如果可能的話，我想轉換第一個文件中的第三列 - 將數字除以100，然後對其進行平方根。例如。本例中新文件的第一行是contig04733 contig00012 0.877 – 2013-03-08 14:18:50

向我們顯示您的代碼。你有什麼嘗試？ – 2013-03-08 14:50:59

@GregBacon - 試試這個： use strict; 使用警告; ＃打開重疊羣對列表打開（PAIRS，「$ ARGV [0]」）或死「打開帶重疊羣對的輸入文件時出錯」; #hash存儲重疊羣ID - 我想？！ my $ pairs; ＃閱讀對列表並讀入內存？ while（）{ \t chomp $ _; \t #get rid riddle whitepace \t $ pairs - > {$ _} = 1; } close（PAIRS）; ＃打開數據文件打開（DATA，「$ ARGV [1]」）或死「錯誤打開序列對文件\ n」; while（）{ \t chomp $ _; \t \t my（$ contigs，$ identity）= split（「\ t」，$ _）; \t if（defined $ pairs - > {$ contigs}）{ \t \t print STDOUT「$ _ \ n」; \t} } close（DATA）; – 2013-03-08 15:52:32

把下面的代碼拼在一起，沒有運行的註釋來獲得工作程序。我們從典型的前端問題開始，指導perl如果您犯了常見錯誤，會給您提供有用的警告。

#! /usr/bin/env perl 

use strict; 
use warnings;

在必要時顯示用戶如何正確調用您的程序總是一個很好的接觸。

die "Usage: $0 master subset\n" unless @ARGV == 2;

read_subset隨着，該程序讀取指定的命令行上秒文件。因爲您的問題表明您不關心訂單，例如，即

contig00001 contig08943

相當於

contig08943 contig00001

代碼遞增兩個$subset{$p1}{$p2}$subset{$p2}{$p1}和。

sub read_subset { 
    my($path) = @_; 

    my %subset; 
    open my $fh, "<", $path or die "$0: open $path: $!"; 
    while (<$fh>) { 
    chomp; 
    my($p1,$p2) = split /\t/; 
    ++$subset{$p1}{$p2}; 
    ++$subset{$p2}{$p1}; 
    } 

    %subset; 
}

在Perl程序中，使用散列來標記程序觀察到的事件頻繁發生。實際上，Perl FAQ中的很多例子都使用了名爲%seen的哈希，如「我有看到這樣。「

通過刪除第二個命令行參數pop，僅留下主文件，它使程序可以使用while (<>) { ... }輕鬆讀取所有輸入行。隨着%subset填充，代碼將每行分割成字段並跳過所有未標記的行。通過這個過濾器的所有東西都打印在標準輸出上。

my %subset = read_subset pop @ARGV; 
while (<>) { 
    my($f1,$f2) = split /\t/; 
    next unless $subset{$f1}{$f2}; 
    print; 
}

例如：

$ cat file1 
contig04733  contig00012  77 
contig00546  contig01344  12 
contig08943  contig00001  14 
contig00765  contig03125  88 

$ cat file2 
contig04733  contig00012 
contig00001  contig08943 

$ perl extract-subset file1 file2 
contig04733  contig00012  77 
contig08943  contig00001  14

創建包含所選擇的子集的新的輸出，重定向標準輸出作爲

$ perl extract-subset file1 file2 >my-subset

來源

2013-03-08 16:48:05

非常感謝你 - 特別是爲了打破它，所以我可以從中學習！ – 2013-03-08 17:28:21

嘗試基於這一個使用散列的散列在兩個鍵上（分開後）

use strict; 
use warnings; 

#open the contig pairs list 
open (PAIRS, "$ARGV[0]") or die "Error opening the input file with contig pairs"; 

#hash to store contig IDs - I think?! 
#my $pairs; 

#read through the pairs list and read into memory? 
my %all_configs; 
while(<PAIRS>){ 
    chomp $_; #get rid of ending whitepace 
    my @parts = split("\t", $_); #split into ['contig04733', 'contig00012', 77] 
    #store the entire row as a hash of hashes 
    $all_configs{$parts[0]}{$parts[1]} = $_; 
    #$pairs->{$_} = 1; 
} 
close(PAIRS); 

#open data file 
open(DATA, "$ARGV[1]") or die "Error opening the sequence pairs file\n"; 
while(<DATA>){ 
    chomp $_; 
    my ($contigs, $identity) = split("\t", $_); 
    #see if we find a row, one way, or the other 
    my $found_row = $all_configs{$contigs}{$identity} 
     || $all_configs{$identity}{$contigs}; 
    #if found, the split, and handle each part  
    if ($found_row) { 
     my @parts = split("\t", $found_row); 
     #same sequence as in first file 
     my $modified_row = $parts[0]."\t".$parts[1].(sqrt($parts[2]/100)); 
     #if you want to keep the same sequence as found in second file 
     my $modified_row = $contigs."\t".$identity.(sqrt($parts[2]/100)); 

     print STDOUT $found_row."\n"; #or 
     print STDOUT $modified_row."\n"; 
    } 
} 
close(DATA);

來源

2013-03-08 16:48:09 FtLie

從一個單獨的列表的文本文件選擇數據 - Perl或UNIX

回答

相關問題