在兩個文件中發現獨特的句子

我有兩個文件，我試圖在兩個文件之間打印獨特的句子。爲此，我在python中使用difflib。在兩個文件中發現獨特的句子

text ='Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry.' 
text1 ='Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry.' 
import difflib 

differ = difflib.Differ() 
diff = differ.compare(text,text1) 
print '\n'.join(diff)

它不給我想要的輸出。它給我這樣。

我期望的輸出是兩個文件之間只是唯一的句子。

文本=也許是最古老的通過將其列入天文學。在過去的兩千年中，這個數字超過了。

text1 =量子化學是化學的一個分支。

此外，它似乎difflib.Differ是逐行而不是句子。請提供任何建議。我該怎麼做？

來源

2016-12-01 Raj

首先，Differ（）。compare（）比較行，而不是句子。

其次，它實際上是比較序列，如字符串列表。但是，您傳遞兩個字符串，而不是兩個字符串列表。由於字符串也是一個字符序列，因此您的案例中的Differ（）。compare（）會比較各個字符。

如果要按句子比較文件，則必須準備兩個句子列表。您可以使用nltk.sent_tokenize（文本）將字符串拆分爲句子。

diff = differ.compare(nltk.sent_tokenize(text),nltk.sent_tokenize(text1)) 
print('\n'.join(diff)) 
# Physics is one of the oldest academic disciplines. 
#- Perhaps the oldest through its inclusion of astronomy. 
#- Over the last two millennia. 
# Physics was a part of natural philosophy along with chemistry. 
#+ Quantum chemistry is a branch of chemistry.

來源

2016-12-01 05:22:48 DyZ

謝謝DYZ。謝謝你指出我的錯誤。我還有一個問題要問你。假設我們有一個字符串「我是男孩」，另一個字符串是「我是男孩」。有一個（，）之後。 differ.compare說，他們都是獨特的不相似，因爲（，）。我們如何在這裏考慮這種情況。我是與nltk相關的。但我可以在這裏處理這個案子嗎？ – Raj

我對difflib軟件包並不熟悉（但我很高興瞭解它！），但在通過diff運行文本之前，您還可以手動去除任何標點符號。在分割期間之前，檢查您的字符串strip（）。 – SummerEla

我建議你使用nltk中的單詞標記器來提取單詞：'「」.join（如果是w.isalpha（）），則用於nltk.word_tokenize（'我是，男孩'）中的w。或者你可以使用正則表達式來提取單詞。 – DyZ

正如DZinoviev在上面陳述的那樣，您將字符串傳遞給期望列表的函數。你不需要使用NLTK，你可以通過在句點上分割來把你的字符串變成句子列表。

import difflib 

text1 ="""Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry.""" 
text2 ="""Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry.""" 

list1 = list(text1.split(".")) 
list2 = list(text2.split(".")) 

differ = difflib.Differ() 
diff = differ.compare(list1,list2) 
print "\n".join(diff)

來源

2016-12-01 05:41:27 SummerEla

謝謝SummerEla – Raj

可能有其他標點符號分隔句子，如！，？，...等 – DyZ

在兩個文件中發現獨特的句子

回答

相關問題