2016-12-01 14 views
1

我有兩個文件,我試圖在兩個文件之間打印獨特的句子。爲此,我在python中使用difflib。在兩個文件中發現獨特的句子

text ='Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry.' 
text1 ='Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry.' 
import difflib 

differ = difflib.Differ() 
diff = differ.compare(text,text1) 
print '\n'.join(diff) 

它不給我想要的輸出。它給我這樣。

P 
    h 
    y 
    s 
    i 
    c 
    s 

    i 
    s 

    o 
    n 
    e 

    o 
    f 

    t 
    h 
    e 

我期望的輸出是兩個文件之間只是唯一的句子。

文本=也許是最古老的通過將其列入天文學。在過去的兩千年中,這個數字超過了 。

text1 =量子化學是化學的一個分支。

此外,它似乎difflib.Differ是逐行而不是句子。請提供任何建議。我該怎麼做?

回答

2

首先,Differ()。compare()比較行,而不是句子。

其次,它實際上是比較序列,如字符串列表。但是,您傳遞兩個字符串,而不是兩個字符串列表。由於字符串也是一個字符序列,因此您的案例中的Differ()。compare()會比較各個字符。

如果要按句子比較文件,則必須準備兩個句子列表。您可以使用nltk.sent_tokenize(文本)將字符串拆分爲句子。

diff = differ.compare(nltk.sent_tokenize(text),nltk.sent_tokenize(text1)) 
print('\n'.join(diff)) 
# Physics is one of the oldest academic disciplines. 
#- Perhaps the oldest through its inclusion of astronomy. 
#- Over the last two millennia. 
# Physics was a part of natural philosophy along with chemistry. 
#+ Quantum chemistry is a branch of chemistry. 
+0

謝謝DYZ。謝謝你指出我的錯誤。我還有一個問題要問你。假設我們有一個字符串「我是男孩」,另一個字符串是「我是男孩」。有一個(,)之後。 differ.compare說,他們都是獨特的不相似,因爲(,)。我們如何在這裏考慮這種情況。我是與nltk相關的。但我可以在這裏處理這個案子嗎? – Raj

+0

我對difflib軟件包並不熟悉(但我很高興瞭解它!),但在通過diff運行文本之前,您還可以手動去除任何標點符號。在分割期間之前,檢查您的字符串strip()。 – SummerEla

+0

我建議你使用nltk中的單詞標記器來提取單詞:'「」.join(如果是w.isalpha()),則用於nltk.word_tokenize('我是,男孩')中的w。或者你可以使用正則表達式來提取單詞。 – DyZ

1

正如DZinoviev在上面陳述的那樣,您將字符串傳遞給期望列表的函數。你不需要使用NLTK,你可以通過在句點上分割來把你的字符串變成句子列表。

import difflib 

text1 ="""Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry.""" 
text2 ="""Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry.""" 

list1 = list(text1.split(".")) 
list2 = list(text2.split(".")) 

differ = difflib.Differ() 
diff = differ.compare(list1,list2) 
print "\n".join(diff) 
+0

謝謝SummerEla – Raj

+0

可能有其他標點符號分隔句子,如!,?,...等 – DyZ

相關問題