2016-01-13 118 views
1

我正在測試NLTK package的詞彙。我使用了下面的代碼,並希望看到所有的TrueNLTK詞彙中缺少單詞 - Python

import nltk 

english_vocab = set(w.lower() for w in nltk.corpus.words.words()) 

print ('answered' in english_vocab) 
print ('unanswered' in english_vocab) 
print ('altered' in english_vocab) 
print ('alter' in english_vocab) 
print ('looks' in english_vocab) 
print ('look' in english_vocab) 

但我的結果如下,這麼多的話丟失了,或者說某種形式的單詞的缺失?我錯過了什麼嗎?

False 
True 
False 
True 
False 
True 

回答

2

事實上,胼不是所有的英語單詞一個詳盡的清單,而是一組文本。判斷單詞是否爲有效的英文單詞的更合適的方法是使用wordnet:

from nltk.corpus import wordnet as wn 

print wn.synsets('answered') 
# [Synset('answer.v.01'), Synset('answer.v.02'), Synset('answer.v.03'), Synset('answer.v.04'), Synset('answer.v.05'), Synset('answer.v.06'), Synset('suffice.v.01'), Synset('answer.v.08'), Synset('answer.v.09'), Synset('answer.v.10')] 

print wn.synsets('unanswered') 
# [Synset('unanswered.s.01')] 

print wn.synsets('notaword') 
# [] 
2

NLTK corpora實際上並沒有存儲每個單詞,它們被定義爲「大量文本」。

例如,您正在使用words語料庫,我們可以通過使用其readme()方法來檢查它的定義:

>>> print(nltk.corpus.words.readme()) 
Wordlists 

en: English, http://en.wikipedia.org/wiki/Words_(Unix) 
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932) 

Unix的話並不詳盡,所以它可能確實是丟失了一些話。語料庫本質上是不完整的(因此強調自然語言)。

話雖這麼說,你可能想嘗試使用從字典派生的語料庫,如brown

>>> print(nltk.corpus.brown.readme()) 
BROWN CORPUS 

A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. 

by W. N. Francis and H. Kucera (1964) 
Department of Linguistics, Brown University 
Providence, Rhode Island, USA 

Revised 1971, Revised and Amplified 1979 

http://www.hit.uib.no/icame/brown/bcm.html 

Distributed with the permission of the copyright holder, redistribution permitted.