bash腳本優化

這是有問題的腳本：bash腳本優化

for file in `ls products` 
do 
    echo -n `cat products/$file \ 
    | grep '<td>.*</td>' | grep -v 'img' | grep -v 'href' | grep -v 'input' \ 
    | head -1 | sed -e 's/^ *<td>//g' -e 's/<.*//g'` 
done

我要上50000+的文件，這將需要大約12小時，此腳本運行。

的算法如下：

查找表含有細胞（<td>）僅行不包含任何「IMG」，「href」屬性，或「輸入」的。
選擇其中的第一個，然後提取標籤之間的數據。

通常的bash文本過濾器（sed，grep，awk等）以及perl都可用。

來源

2011-05-05 Marko

如果您不打算執行這個操作不是一次或兩次以上，如果它需要1/2一天跑誰在乎呢？如果你花2個小時對其進行優化，只能獲得1小時的速度提升......這是否值得呢？ – cdeszaq 2011-05-05 19:29:04

@cdeszaq：我還有其他四個類似的腳本，我相信一旦我看到這個優化的腳本，我就可以優化它。 – Marko 2011-05-05 19:34:47

貌似可以全部由一個gawk的命令來替換：

gawk ' 
    /<td>.*<\/td>/ && !(/img/ || /href/ || /input/) { 
     sub(/^ *<td>/,""); sub(/<.*/,"") 
     print 
     nextfile 
    } 
' products/*

此用途gawk擴展nextfile。

如果通配符膨脹過大，那麼

find products -type f -print | xargs gawk '...'

來源

2011-05-05 19:58:16

+1非常好 – hmontoliu 2011-05-05 20:39:41

下面是一些快速perl來做整個事情，應該更快。

#!/usr/bin/perl 

process_files($ARGV[0]); 

# process each file in the supplied directory 
sub process_files($) 
{ 
    my $dirpath = shift; 
    my $dh; 
    opendir($dh, $dirpath) or die "Cant readdir $dirpath. $!"; 
    # get a list of files 
    my @files; 
    do { 
    @files = readdir($dh); 
    foreach my $ent (@files){ 
     if (-f "$dirpath/$ent"){ 
     get_first_text_cell("$dirpath/$ent"); 
     } 
    } 
    } while ($#files > 0); 
    closedir($dh); 
} 

# return the content of the first html table cell 
# that does not contain img,href or input tags 
sub get_first_text_cell($) 
{ 
    my $filename = shift; 
    my $fh; 
    open($fh,"<$filename") or die "Cant open $filename. $!"; 
    my $found = 0; 
    while ((my $line = <$fh>) && ($found == 0)){ 
    ## capture html and text inside a table cell 
    if ($line =~ /<td>([&;\d\w\s"'<>]+)<\/td>/i){ 
     my $cell = $1; 

     ## omit anything with the following tags 
     if ($cell !~ /<(img|href|input)/){ 
     $found++; 
     print "$cell\n"; 
     } 
    } 
    } 
    close($fh); 
}

只需通過將目錄調用它要搜索的第一個參數：

$ perl parse.pl /html/documents/

來源

2011-05-05 19:41:33 IanNorton

在我的系統上運行這個包含1000個文件的測試集，它只需不到一秒鐘的時間。 – IanNorton 2011-05-05 19:53:57

（應該是更快，更清晰）這個是什麼：

for file in products/*; do 
    grep -P -o '(?<=<td>).*(?=<\/td>)' $file | grep -vP -m 1 '(img|input|href)' 
done

的for將目光中的每個文件products 。 查看與您的語法的區別。
第一個grep將只輸出<td>和</td>之間的文本，只要每個單元格在一行中就沒有這些標籤。
終於第二grep將輸出只是第一線（這是什麼，我相信你想與head -1來實現）不包含img，href或input（將正確的出口則減少了這些線總的時間允許更快地處理下一個文件）

我會喜歡使用一個單一的grep，但然後正則表達式會非常糟糕。 :-)

免責聲明：當然，我沒有測試它

來源

2011-05-05 20:06:20 hmontoliu

回答

相關問題