Python NLTK ::相交的單詞和句子

我正在使用NLTK - 一個特定的工具包來操作語料庫文本，並且我定義了一個函數來交叉用戶輸入和莎士比亞的話。Python NLTK ::相交的單詞和句子

def shakespeareOutput(userInput): 

    user = userInput.split() 
    user = random.sample(set(user), 3) 

    #here is NLTK's method 
    play = gutenberg.sents('shakespeare-hamlet.txt') 

    #all lowercase 
    hamlet = map(lambda sublist: map(str.lower, sublist), play)

print hamlet回報：

[ ['[', 'the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare', '1599', ']'], 
['actus', 'primus', '.'], 
['scoena', 'prima', '.'], 
['enter', 'barnardo', 'and', 'francisco', 'two', 'centinels', '.'], 
['barnardo', '.'], 
['who', "'", 's', 'there', '?']...['finis', '.'], 
['the', 'tragedie', 'of', 'hamlet', ',', 'prince', 'of', 'denmarke', '.']]

我想找到它包含了大部分出現用戶字的句子，並返回了一句。我想：

bestCount = 0 
    for sent in hamlet: 
     currentCount = len(set(user).intersection(sent)) 
     if currentCount > bestCount: 
      bestCount = currentCount 
      answer = ' '.join(sent) 
      return ''.join(answer).lower(), bestCount

調用該函數：

shakespeareOutput("The Actus Primus")

回報：

['The', 'Actus', 'Primus'] None

我究竟做錯了什麼？

在此先感謝。

來源

2016-06-11 data_garden

我認爲'return'語句應該不在for循環中。否則，該函數將返回'hamlet'列表中的第一個'sent'項目。 – Rahul

您的評估方法currentCount是錯誤的。設置交集返回匹配的不同元素的數量，而不是匹配元素的數量。

>>> s = [1,1,2,3,3,4] 
>>> u = set([1,4]) 
>>> u.intersection(s) 
set([1, 4]) # the len is 2, however the total number matched elements are 3

使用以下代碼。

bestCount = 0 

for sent in hamlet: 
    currentCount = sum([sent.count(i) for i in set(user)]) 
    if currentCount > bestCount: 
     bestCount = currentCount 
     answer = ' '.join(sent) 

return answer.lower(), bestCount

來源

2016-06-11 06:08:31 Rahul

實際上，這個想法沒有返回的總和，而是更接近於輸入的ONE **句子，因此len（）最適合客觀。但是謝謝你，你教給我一些關於交叉點和計數之間差異的好消息。 –

Python NLTK ::相交的單詞和句子

回答

相關問題