在猛砸

兩個不對稱文件之間產生差異我有一個大的文本文件biggerFile用2M條目和另一個文本文件更小，用1M entires ..在猛砸

所有在較小的文件文件2項是存在於文件1

條目的大文件中的格式是..

helloworld_12345_987654312.zip 
helloWorld_12344_987654313.zip 
helloWOrld_12346_987654314.zip

較小的文件含有像

987654312 
987654313

數據

即文件擴展名前.ZIP文件名的最後一部分，可能有人給任何指針我怎麼能做到這一點

我的嘗試是運行在更小的文件一個循環，並做較大的文件grep和保持如果在大文件中找到該文件，則刪除條目。因此，在該過程結束時，我將在文件中留下缺少的條目。

雖然這個解決方案的工作，但它的效率低下和原油。可有一個人提出一個更好的辦法對這個問題

來源

2013-08-29 dpsdce

它是一個常數位數？ – FakeRainBrigand

所以較小的文件沒有文件名，只是包含在較大文件中存在的文件名中的數字？ – lurker

grep的具有開關-f它從文件中讀取的模式。將其與-v相結合，只打印不匹配的行，並且您有一個優雅的解決方案。由於您的模式是固定字符串，因此在使用-F時可以顯着提高性能。

grep -F -v -f smallfile bigfile

我寫了一個python腳本生成一些測試數據：

bigfile = open('bigfile', 'w') 
smallfile = open('smallfile', 'w') 

count = 2000000 
start = 1000000 

for i in range(start, start + count): 
    bigfile.write('foo' + str(i) + 'bar\n') 
    if i % 2: 
    smallfile.write(str(i) + '\n') 

bigfile.close() 
smallfile.close()

這裏有一些測試中，我只使用2000線（集數到2000年），因爲多線所需的時間跑就跑沒有-F的grep變得可笑了。

$ time grep -v -f smallfile bigfile > /dev/null 

real 0m3.075s 
user 0m2.996s 
sys 0m0.028s 

$ time grep -F -v -f smallfile bigfile > /dev/null 

real 0m0.011s 
user 0m0.000s 
sys 0m0.012s

grep的也有--mmap開關可能根據手冊頁來提高性能。在我的測試中，沒有性能提升。

對於這些測試中，我用了2萬線。

$ time grep -F -v -f smallfile bigfile > /dev/null 

real 0m3.900s 
user 0m3.736s 
sys 0m0.104s 

$ time grep -F --mmap -v -f smallfile bigfile > /dev/null 

real 0m3.911s 
user 0m3.728s 
sys 0m0.128s

來源

2013-08-29 11:04:12 lesmana

使用grep。您可以指定較小的文件作爲從中獲取模式的文件（使用-f filename）並執行-v以獲取與模式不匹配的行。

由於您的模式出現固定，您還可以提供-F選項，這將加速grep。

下面列出的是不言自明的：

$ cat big 
helloworld_12345_987654312.zip 
helloWorld_12344_987654313.zip 
helloWOrld_12346_987654314.zip 
$ cat small 
987654312 
987654313 
$ grep -F -f small big  # Find lines matching those in the smaller file 
helloworld_12345_987654312.zip 
helloWorld_12344_987654313.zip 
$ grep -F -v -f small big # Eliminate lines matching those in the smaller file 
helloWOrld_12346_987654314.zip

來源

2013-08-29 11:04:06 devnull

回答

相關問題