Python：字符串中有多少個相似的單詞？

我有一些醜陋的字符串類似以下：Python：字符串中有多少個相似的單詞？

string1 = 'Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)' 
    string2 = 'Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)'

我想一個庫或算法，這將使我的，他們有多少的話有共同的一個百分比，而不含特殊字符，如','和':'和'''和'{'等。

我知道Levenshtein algorithm。然而，這比較類似性狀的數字，而我想比較，他們有多少WORDS有共同

來源

2010-08-24 l--' ' ' ' ' ' ---------' ' ' ' ' ' ' ' ' ' ' '

Levenshtein算法適用於任何2個可比對象序列......另一種放置方法：只要定義了[a [i] == b [j]'並且有意義。 – 2010-08-25 03:52:28

正則表達式可以很容易地給你所有的話：

import re 
s1 = "Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)" 
s2 = "Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)" 
s1w = re.findall('\w+', s1.lower()) 
s2w = re.findall('\w+', s2.lower())

collections.Counter（Python的2.7+）可以快速計數的時間發生詞的數量。

from collections import Counter 
s1cnt = Counter(s1w) 
s2cnt = Counter(s2w)

一個非常粗略比較可以通過set.intersection或difflib.SequenceMatcher來完成，但它聽起來像是你想實現一個Levenshtein算法與也就是說，在那裏你可以使用這兩個列表交易。

common = set(s1w).intersection(s2w) 
# returns set(['c']) 

import difflib 
common_ratio = difflib.SequenceMatcher(None, s1w, s2w).ratio() 
print '%.1f%% of words common.' % (100*common_ratio)

打印：3.4% of words similar.

來源

2010-08-24 16:52:15

+1主要用於collections.Counter - stdlib的另一個隱藏的寶石。可惜它是2.7，所以可能不適用。 – delnan 2010-08-24 16:56:38

n = 0 
words1 = set(sentence1.split()) 
for word in sentence2.split(): 
    # strip some chars here, e.g. as in [1] 
    if word in words1: 
     n += 1

（1：How to remove symbols from a string with Python?）

編輯：請注意，這考慮了字如果它們出現在兩個句子中的任何一處，那麼這兩個句子都是共同的 - 爲了比較位置，可以省略設置轉換（只需在兩者上調用split（）），使用類似：

n = 0 
for word_from_1, word_from_2 in zip(sentence1.split(), sentence2.split()): 
    # strip some chars here, e.g. as in [1] 
    if word_from_1 == word_from_2: 
     n += 1

來源

2010-08-24 16:45:48 delnan

哪個庫？ – 2010-08-24 16:46:19

咦？這隻使用內置函數，不需要導入任何內容。 – delnan 2010-08-24 16:48:18

的Lenvenshtein algorithm本身並不侷限於比較字符，它可以比較任意對象。古典形式使用字符的事實是一個實現細節，它們可以是任何可以比較的符號或構造。

在Python中，將字符串轉換爲單詞列表，然後將算法應用於列表。也許別人可以幫助你清理不需要的字符，大概使用一些正則表達式魔法。

來源

2010-08-24 16:52:36

Python：字符串中有多少個相似的單詞？

回答

相關問題