快速替代到grep -f

file.contain.query.txt快速替代到grep -f

ENST001 

ENST002 

ENST003

file.to.search.in.txt

ENST001 90 

ENST002 80 

ENST004 50

因爲ENST003在第二個文件，並ENST004沒有進入在第一個文件中沒有進入預期的輸出結果是：

ENST001 90 

ENST002 80

要在特定的文件，我們通常做以下的grep多查詢：

grep -f file.contain.query <file.to.search.in >output.file

因爲我有像10000查詢和幾乎100000原始file.to.search.in需要很長時間才能完成（如5小時）。有沒有一種快速替代grep -f？

來源

2012-07-15 user1421408

您的需求是？你想要一個文件的第二行用第一個鍵的過濾嗎？ – 2012-07-15 06:54:16

我編輯了預期的結果 – user1421408 2012-07-15 06:56:40

輸入重定向是不必要的。 – 2012-07-15 11:02:49

如果你想要一個純Perl的選項，閱讀你的查詢文件的密鑰到一個哈希表，然後檢查標準輸入對這些按鍵：

#!/usr/bin/env perl 
use strict; 
use warnings; 

# build hash table of keys 
my $keyring; 
open KEYS, "< file.contain.query.txt"; 
while (<KEYS>) { 
    chomp $_; 
    $keyring->{$_} = 1; 
} 
close KEYS; 

# look up key from each line of standard input 
while (<STDIN>) { 
    chomp $_; 
    my ($key, $value) = split("\t", $_); # assuming search file is tab-delimited; replace delimiter as needed 
    if (defined $keyring->{$key}) { print "$_\n"; } 
}

你會使用它，像這樣：

lookup.pl < file.to.search.txt

哈希表可以利用的內存相當，但搜索是多少更快（哈希表查找是在不變的時間），這是很方便的，因爲你有10倍以上的查找鍵比存儲。

來源

2012-07-15 07:12:15

這是法拉利與grep -f相比時的感謝 – user1421408 2012-07-15 07:22:10

完美的解決方案; +1 – 2012-07-15 10:18:56

此的Perl代碼可能幫助你：

use strict; 
open my $file1, "<", "file.contain.query.txt" or die $!; 
open my $file2, "<", "file.to.search.in.txt" or die $!; 

my %KEYS =(); 
# Hash %KEYS marks the filtered keys by "file.contain.query.txt" file 

while(my $line=<$file1>) { 
    chomp $line; 
    $KEYS{$line} = 1; 
} 

while(my $line=<$file2>) { 
    if($line =~ /(\w+)\s+(\d+)/) { 
     print "$1 $2\n" if $KEYS{$1}; 
    } 
} 

close $file1; 
close $file2;

來源

2012-07-15 07:07:13

你忘了檢查系統調用的返回值。 – tchrist 2012-07-15 16:08:05

Mysql：

將數據導入到Mysql或類似軟件將提供巨大的改進。這是可行的嗎？您可以在幾秒鐘內看到結果。

mysql -e 'select search.* from search join contains using (keyword)' > outfile.txt 

# but first you need to create the tables like this (only once off) 

create table contains (
    keyword varchar(255) 
    , primary key (keyword) 
); 

create table search (
    keyword varchar(255) 
    ,num bigint 
    ,key (keyword) 
); 

# and load the data in: 

load data infile 'file.contain.query.txt' 
    into table contains fields terminated by "add column separator here"; 
load data infile 'file.to.search.in.txt' 
    into table search fields terminated by "add column separator here";

來源

2012-07-15 07:18:12

我沒有測試過這個，但它會根據你的情況稍作調整。除非你希望它是以內存爲基礎的，否則它只需要很少的內存。 – 2012-07-15 07:19:41

use strict; 
use warings; 

system("sort file.contain.query.txt > qsorted.txt"); 
system("sort file.to.search.in.txt > dsorted.txt"); 

open (QFILE, "<qsorted.txt") or die(); 
open (DFILE, "<dsorted.txt") or die(); 


while (my $qline = <QFILE>) { 
    my ($queryid) = ($qline =~ /ENST(\d+)/); 
    while (my $dline = <DFILE>) { 
    my ($dataid) = ($dline =~ /ENST(\d+)/); 
    if ($dataid == $queryid) { print $qline; } 
    elsif ($dataid > $queryid) { break; } 
    } 
}

來源

2012-07-15 07:26:56 perreal

如果你有固定的字符串，請使用grep -F -f。這比正則表達式搜索要快得多。

來源

2012-07-15 08:17:50 tripleee

如果文件已經排序：

join file1 file2

如果不是：

join <(sort file1) <(sort file2)

來源

2012-07-15 11:01:57

如果您正在使用的perl版本5.10或更高版本，您可以加入「查詢」項爲正則表達式查詢條件由'pipe'分隔。（例如：ENST001|ENST002|ENST003）Perl構建了一個'trie'，它像散列一樣在不斷的時間內進行查找。它應該使用查找哈希運行速度與解決方案一樣快。只是爲了展示另一種方式來做到這一點。

#!/usr/bin/perl 
use strict; 
use warnings; 
use Inline::Files; 

my $query = join "|", map {chomp; $_} <QUERY>; 

while (<RAW>) { 
    print if /^(?:$query)\s/; 
} 

__QUERY__ 
ENST001 
ENST002 
ENST003 
__RAW__ 
ENST001 90 
ENST002 80 
ENST004 50

來源

2012-07-15 15:13:27

快速替代到grep -f

回答

相關問題