因爲只需要比較幹或給定字的「詞根」,我建議使用一些詞幹算法。詞幹算法試圖自動刪除後綴(以及某些情況下的前綴),以便查找給定詞的「詞根」或詞幹。這在各種自然語言處理場景中非常有用,例如搜索。幸運的是,有一個用於stemming
的python包。您可以從here下載它。
接下來,你要不停字(一,一個的,從等)來比較字符串。所以你需要在比較字符串之前過濾這些單詞。您可以從互聯網獲取停用詞列表,也可以使用nltk
軟件包導入停用詞列表。您可以從3210
得到nltk
如果沒有與nltk
任何問題,這裏是停止詞列表:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
'should', 'now']
現在用這個簡單的代碼來獲取所需輸出:
from stemming.porter2 import stem
from nltk.corpus import stopwords
stopwords_ = stopwords.words('english')
def addString(x):
flag = True
y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
for i in section:
i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
if y==i:
flag = False
break
if flag:
section.append(x)
print "\tNew Section Added"
演示:
>>> from stemming.porter2 import stem
>>> from nltk.corpus import stopwords
>>> stopwords_ = stopwords.words('english')
>>>
>>> def addString(x):
... flag = True
... y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
... for i in section:
... i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
... if y==i:
... flag = False
... break
... if flag:
... section.append(x)
... print "\tNew Section Added"
...
>>> section = [ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"] # initial Section list
>>> addString("Activity (Last 30 Days)")
New Section Added
>>> addString("Executables running from disk")
>>> addString("Actions from a file")
>>> section
['Activity (Last 3 Days)', 'Activity (Last 7 days)', 'Executable running from disk', 'Actions from File', 'Activity (Last 30 Days)'] # Final section list
感謝這聽起來很有趣,但我無法訪問停用詞列表。它給我錯誤: –
資源u'corpora/stopwords'找不到。請使用NLTK 下載來獲得資源:>>> nltk.download()搜索在 : - 'C:\\用戶\\ Unnati_Shukla/nltk_data' - 'C:\\ nltk_data' - 「d :\\ nltk_data ' - 'E:\\ nltk_data' - 'C:\\ Python27 \\ nltk_data' - 'C:\\ Python27 \\ LIB \\ nltk_data' - ' C:\\用戶\ \ Unnati_Shukla \\ AppData \\ Roaming \\ nltk_data' –
我已經上傳了'nltk.stopwords'返回的停用詞列表。直接使用它,並從代碼中移除'from nltk.corpus import stopwords'和'stopwords_ = stopwords.words('english')'行。將'stopwords_'分配給我已上傳的列表... –