搜索大型tar.gz文件的關鍵字，複製和刪除

什麼是最好的方式與大型日誌tar.gz文件，一些是20演出，打開和搜索關鍵字，複製找到的文件到目錄，然後刪除該文件不會佔用磁盤空間。我下面有一些代碼，它正在工作，但之後由於某種原因突然停止提取文件。如果我從tar中刪除-O選項，它會再次提取文件。搜索大型tar.gz文件的關鍵字，複製和刪除

mkdir -p found; 
tar tf "$1" | while read -r FILE 
do 
    if tar xf "$1" "$FILE" -O | grep -l "$2" ;then 
     echo "found pattern in : $FILE"; 
     cp $FILE found/$(basename $FILE); 
     rm -f $FILE; 
    fi 
done

$ 1是tar.gz文件，$ 2是關鍵字

UPDATE

Im做以下這工作，但一個小文件，我已經擁有200萬加壓縮文件，所以將需要幾個小時來查看所有文件。是否有python解決方案或類似的，可以更快地做到這一點。

#!/bin/sh 
# tarmatch.sh 
if grep -l "$1" ; then 
    echo "Found keyword in ${TAR_FILENAME}"; 
    tar -zxvf "$2" "${TAR_FILENAME}" 
else 
    echo "Not found in ${TAR_FILENAME}"; 
fi 
true 

tar -zxf 20130619.tar.gz --to-command "./tarmatch.sh '@gmail' 20130619.tar.gz "

立即更新使用python，似乎2

林在速度的增加，在做約4000記錄第二，而bash的版本正在做關於5.Im不是強在Python等等大概這個代碼可以優化，請讓我知道這是否可以優化。

import tarfile 
import time 
import os 
import ntpath, sys 

if len(sys.argv) < 3 : 
    print "Please provide the tar.gz file and keyword to search on" 
    print "USAGE: tarfind.py example.tar.gz keyword" 
    sys.exit() 

t = tarfile.open(sys.argv[1], 'r:gz') 
cnt = 0; 
foundCnt = 0; 
now = time.time() 
directory = 'found/' 
if not os.path.exists(directory): 
    os.makedirs(directory) 

for tar_info in t: 
    cnt+=1; 
    if (tar_info.isdir()): continue 
    if(cnt%1000 == 0): print "Processed " + str(cnt) + " files" 
    f=t.extractfile(tar_info) 
    if sys.argv[2] in f.read(): 
     foundCnt +=1 
     newFile = open(directory + ntpath.basename(tar_info.name), 'w'); 
     f.seek(0,0) 
     newFile.write(f.read()) 
     newFile.close() 
     print "found in file " + tar_info.name 

future = time.time() 
timeTaken = future-now 

print "Found " + str(foundCnt) + " records" 
print "Time taken " + str(int(timeTaken/60)) + " mins " + str(int(timeTaken%60)) + " seconds" 
print str(int(cnt/timeTaken)) + " records per second" 
t.close()

來源

2013-07-29 tsukimi

如果你想搜索的文件的關鍵詞，只有那些和你既然文件大小是巨大的提取，它可能如果關鍵字是某個地方在中間需要時間。

我可以給出的最好建議可能是使用反向索引查找工具（如Solr（基於Lucene Indes）和Apache Tika（內容分析工具包））的強大組合。

使用這些工具，您可以爲tar.gz文件建立索引，並且當您搜索關鍵字時，相關文檔包含關鍵字將被返回。

來源

2013-07-29 10:54:24

當我運行第一行時，它返回'-bash：-ztvf：command not found' – tsukimi

啊！現在檢查你需要做些什麼像$（any_commond）從bash終端運行，或者你可以把這些命令放在一個腳本中。 –

根據適用情況，使用上述命令相應地替換您的腳本。 –

如果文件真的是20GB，grep在任何情況下都會花費很長時間。我能給的唯一建議是使用zgrep。這將使您不必顯式解壓縮存檔。

zgrep PATTERN your.tgz

來源

2013-07-29 10:44:26 hek2mgl

我將如何使用它將找到的文件複製到另一個目錄？ – tsukimi

我必須承認我沒有仔細閱讀過你的問題。要將文件轉換爲目錄，您必須解壓縮它們。我從來沒有這樣做過，但我期望可以使用'tar'從檔案中解壓出幾個文件。 'zgrep'會給你文件名（在檔案中） – hek2mgl

我運行命令就像這個'zgrep「大阪」20130619.tar.gz「，返回'二進制文件（標準輸入）匹配' – tsukimi

搜索大型tar.gz文件的關鍵字，複製和刪除

回答

相關問題