你並不需要一個Python包裝爲Java lib下,NLTK的得到雪球! :)
>>> from nltk.stem import SnowballStemmer as SS
>>> stemmer = SS('english')
>>> stemmer.stem('dance')
u'danc'
>>> stemmer.stem('danced')
u'danc'
>>> stemmer.stem('dancing')
u'danc'
>>> stemmer.stem('dancer')
u'dancer'
>>> stemmer.stem('accordance')
u'accord'
莖幹並不總是會給你確切的根源,但它是一個很好的開始。
以下是使用莖的示例。我正在構建一本stem: (word, count)
的字典,同時爲每個詞幹選擇可能的最短詞。So ['dancing', 'danced', 'dances', 'dance', 'dancer'] converts to {'danc': ('dance', 4), 'dancer': ('dancer', 1)}
示例代碼:(從http://en.wikipedia.org/wiki/Dance採取文本)
import re
from nltk.stem import SnowballStemmer as SS
text = """Dancing has evolved many styles. African dance is interpretative.
Ballet, ballroom (such as the waltz), and tango are classical styles of dance
while square dancing and the electric slide are forms of step dances.
More recently evolved are breakdancing and other forms of street dance,
often associated with hip hop culture.
Every dance, no matter what style, has something in common.
It not only involves flexibility and body movement, but also physics.
If the proper physics are not taken into consideration, injuries may occur."""
#extract words
words = [word.lower() for word in re.findall(r'\w+',text)]
stemmer = SS('english')
counts = dict()
#count stems and extract shortest words possible
for word in words:
stem = stemmer.stem(word)
if stem in counts:
shortest,count = counts[stem]
if len(word) < len(shortest):
shortest = word
counts[stem] = (shortest,count+1)
else:
counts[stem]=(word,1)
#convert {key: (word, count)} to [(word, count, key)] for convenient sort and print
output = [wordcount + (root,) for root,wordcount in counts.items()]
#trick to sort output by count (descending) & word (alphabetically)
output.sort(key=lambda x: (-x[1],x[0]))
for item in output:
print '%s:%d (Root: %s)' % item
輸出:
dance:7 (Root: danc)
and:4 (Root: and)
are:4 (Root: are)
of:3 (Root: of)
style:3 (Root: style)
the:3 (Root: the)
evolved:2 (Root: evolv)
forms:2 (Root: form)
has:2 (Root: has)
not:2 (Root: not)
physics:2 (Root: physic)
african:1 (Root: african)
also:1 (Root: also)
as:1 (Root: as)
associated:1 (Root: associ)
ballet:1 (Root: ballet)
ballroom:1 (Root: ballroom)
body:1 (Root: bodi)
breakdancing:1 (Root: breakdanc)
---truncated---
您的特定需求,我不會推薦詞形還原:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('dance')
'dance'
>>> lmtzr.lemmatize('dancer')
'dancer'
>>> lmtzr.lemmatize('dancing')
'dancing'
>>> lmtzr.lemmatize('dances')
'dance'
>>> lmtzr.lemmatize('danced')
'danced'
子串不是一個好主意,因爲它在某個時刻總會失敗,而且很多時候都會失敗。
- 固定長度:僞單詞「dancitization」和「dancendence」將在圖4個5分別字符匹配「舞蹈」。
- 比:低比例將返回假貨(如上)
- 率:高比不匹配不夠(如「跑步」)
但隨着制止你:
>>> stemmer.stem('dancitization')
u'dancit'
>>> stemmer.stem('dancendence')
u'dancend'
>>> #since dancitization gives us dancit, let's try dancization to get danc
>>> stemmer.stem('dancization')
u'dancize'
>>> stemmer.stem('dancation')
u'dancat'
這是一個令人印象深刻的不匹配結果乾「danc」。即使考慮「舞者」也不是「跳舞」,但總體準確度相當高。
我希望這可以幫助您開始。
你只能假設後綴。什麼前綴?例如。 - 不成熟,不真實,未完成等。也許你可以看看http://nltk.org/ – RedBaron