查找文本文件中至少有兩個共同字（Bash）的所有行

我有幾個由不同的人生成的大型文本文件。這些文件包含每行一個標題的列表。每個句子都是不同的，但據稱是指未知的項目。查找文本文件中至少有兩個共同字（Bash）的所有行

鑑於格式和措辭不同，我嘗試生成一個較短的文件，可能匹配人工檢查。我是Bash的新手，我嘗試了幾個命令來比較每一行與兩個或多個共同關鍵詞的標題。應避免區分大小寫，超過4個字符的關鍵詞排除文章等。

例子：

輸入文本文件＃1

Investigating Amusing King : Expl and/in the Proletariat 
Managing Self-Confident Legacy: The Harlem Renaissance and/in the Abject 
Inventing Sarcastic Silence: The Harlem Renaissance and/in the Invader 
Inventing Random Ethos: The Harlem Renaissance and/in the Marginalized 
Loss: Supplementing Transgressive Production and Assimilation

輸入文本文件＃2

Loss: Judging Foolhardy Historicism and Homosexuality 
Loss: Developping Homophobic Textuality and Outrage 
Loss: Supplement of transgressive production 
Loss: Questioning Diligent Verbiage and Mythos 
Me Against You: Transgressing Easygoing Materialism and Dialectic

輸出文本文件

File #1-->Loss: Supplementing Transgressive Production and Assimilation 
File #2-->Loss: Supplement of transgressive production

到目前爲止，我已經能夠除草出了幾個副本具有完全相同的條目...

cat FILE_num*.txt | sort | uniq -d > berbatim_duplicates.txt

...等少數其中有括號

cat FILE_num*.txt | sort | cut -d "{" -f2 | cut -d "}" -f1 | uniq -d > same_annotations.txt

，看起來非常有前途的命令是找到正則表達式之間是相同的註解，但我無法使其工作。

在此先感謝。

來源

2015-11-08 bcnguy

我不認爲這個問題很適合'bash' - 當然不是一行！考慮使用像Python這樣的腳本語言，以便您可以更輕鬆地跟蹤每個文件中的行。 –

好吧，你會如此友善地爲我提供一個示例或一些指示開始。 thx – bcnguy

必須有兩個共同的關鍵詞，但在你的例子「補充」==「補充」 – Labo

在Python 3：

from sys import argv 
from re import sub 

def getWordSet(line): 
    line=sub(r'\[.*\]|\(.*\)|[.,!?:]','',line).split() 
    s=set() 
    for word in line: 
     if len(word)>4: 
      word=word.lower() 
      s.add(word) 
    return s 

def compare(file1, file2): 
    file1 = file1.split('\n') 
    file2 = file2.split('\n') 
    for line1,set1 in zip(file1,map(getWordSet,file1)): 
     for line2,set2 in zip(file2,map(getWordSet,file2)): 
      if len(set1.intersection(set2))>1: 
       print("File #1-->",line1,sep='') 
       print("File #2-->",line2,sep='') 

if __name__=='__main__': 
    with open(argv[1]) as file1, open(argv[2]) as file2: 
     compare(file1.read(),file2.read())

給出預期的輸出。它顯示文件的匹配行對。

將此腳本保存在一個文件中 - 我將其稱爲script.py，但您可以根據需要命名它。你可以用

python3 script.py file1 file2

啓動它甚至還可以使用別名：

alias comp="python3 script.py"

然後

comp file1 file2

我包括從以下討論的特徵。

來源

2015-11-08 17:09:29 Labo

Thank you Labo，但給我一個錯誤：文件「find duplicates.py」，第16行 print（「File＃1 - >」，line1，sep =''） ^ SyntaxError：invalid syntax – bcnguy

好的，我對Python沒有任何經驗，所以在調查了一下之後，我用這個打印文件（「File＃1 - >％s」％line1）更改了打印內容，它工作得很好。謝謝！ – bcnguy

我只需要弄清楚如何通過它的文件現在... – bcnguy

查找文本文件中至少有兩個共同字（Bash）的所有行

回答

相關問題