兩個大單詞列表的交集

我有兩個單詞列表（180k和260k），並且我想生成第三個文件，它是出現在兩個列表中的單詞集合。兩個大單詞列表的交集

這樣做的最佳方法是什麼？我讀過論壇討論使用grep，但我認爲單詞列表對於這種方法來說太大了。

2011-01-23 pjama

如果對這兩個文件進行排序（或者可以對它們進行排序），則可以使用comm -1 -2 file1 file2打印出相交點。

2011-01-23 05:58:24

事實證明，它們中的一個分類 - 你能不能給我一個命令排序的另一個？ – pjama 2011-01-23 06:04:47

只要`sort -o outfile infile`，假設其他文件也按字母順序排序。不過，要注意場所。特別是訂單是「AaBb」還是「ABab」可以更改。爲了安全起見，您可能需要明確地對兩個文件進行排序，以確保您使用的是相同的設置。 – 2011-01-23 06:07:03

謝謝你的幫助耶利米！排序工作正常，但*通訊*仍然警告'通信：文件2不是在排序順序' - 但它似乎已經產生*東西*。這聽起來不錯嗎？我會在早上做一些質量保證:) – pjama 2011-01-23 06:19:04

你是對的，grep將是一個壞主意。輸入「man加入」並按照說明操作。

如果你的文件的話在一列剛剛名單，或者至少，如果重要的詞是第一次在每一行，那麼所有你需要做的是：

$ sort -b -o f1 file1 
$ sort -b -o f2 file2 
$ join f1 f2

否則，可能需要給加盟（1）命令的一些附加說明：

JOIN(1)     BSD General Commands Manual     JOIN(1) 

NAME 
    join -- relational database operator 

SYNOPSIS 
    join [-a file_number | -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2 

DESCRIPTION 
    The join utility performs an ``equality join'' on the specified files and writes the result to the standard output. The ``join field'' is the field in each file by which the files are compared. The 
    first field in each line is used by default. There is one line in the output for each pair of lines in file1 and file2 which have identical join fields. Each output line consists of the join field, 
    the remaining fields from file1 and then the remaining fields from file2. 
    . . . 
    . . .

來源

2011-01-23 05:58:30 DigitalRoss

。假定每行一個字，我會用grep：