從文件中刪除停用詞

我想從我的文件中的數據列中刪除停用詞。我過濾了最終用戶講話時的線路。但它並沒有過濾出與usertext.apply(lambda x: [word for word in x if word not in stop_words]) 停止詞我做錯了什麼？從文件中刪除停用詞

import pandas as pd 
from stop_words import get_stop_words 
df = pd.read_csv("F:/textclustering/data/cleandata.csv", encoding="iso-8859-1") 
usertext = df[df.Role.str.contains("End-user",na=False)][['Data','chatid']] 
stop_words = get_stop_words('dutch') 
clean = usertext.apply(lambda x: [word for word in x if word not in stop_words]) 
print(clean)

來源

2017-03-08 DataNewB

first can y ou 1）打印'stop_words'，2）嘗試'clean = usertext.apply（lambda x：[]）'看它是否刪除所有單詞？（只是測試） –

Data [] chatid [] dtype：object ['aan'，'al'，'alles'，'als'，'altijd'，'andere'，'ben'，'bij' ，'dar'，'dan'，'dat'，'de'，'der'，'deze'，'die'，'dit'，'doch'，'doen'，'door' een'，eens，en，er，ge，geen，geweest，haar，had，heb，hebben，heeft，，'het'，'hier'，'hij'，'hoe'，'hun'，'iemand'，'iets'，'ik'，'in'，'是'，'ja'，'je'，' kan'kon'kunnen'maar'me''meer''men''met'mij'mijn'moet'na'naar' ，'niet'，'niets'，'nog'，'nu'，'of'，'om'，'omdat'，...]這是 – DataNewB

clean = usertext.apply(lambda x: x if x not in stop_words else '')

來源

2017-03-08 14:40:22 galaxyan

的輸出，如果可以的話，我建議使用'設置'stop_words'來提高效率。 –

我得到NameError：（「名稱」字'未定義'，'發生在索引數據'）當我運行它 – DataNewB

@DataNewB對不起，它應該是x – galaxyan

你可以建立你的停止字的正則表達式，並調用矢量化str.replace將其刪除：

In [124]: 
stop_words = ['a','not','the'] 
stop_words_pat = '|'.join(['\\b' + stop + '\\b' for stop in stop_words]) 
stop_words_pat 

Out[124]: 
'\\ba\\b|\\bnot\\b|\\bthe\\b' 

In [125]:  
df = pd.DataFrame({'text':['a to the b', 'the knot ace a']}) 
df['text'].str.replace(stop_words_pat, '') 

Out[125]: 
0   to b 
1  knot ace 
Name: text, dtype: object

在這裏，我們執行列表中理解到建立圍繞每個停用詞的模式與'\b'這是一個休息，然後我們or使用的所有單詞'|'

來源

2017-03-08 14:55:42 EdChum

兩個問題：

首先，您有一個名爲stop_words的模塊，稍後您將創建一個名爲stop_words的變量。這是不好的形式。

其次，您將一個lambda函數傳遞給.apply，它希望其x參數成爲列表，而不是列表中的值。

也就是說，而不是做df.apply(sqrt)你在做df.apply(lambda x: [sqrt(val) for val in x])。

您應該做的列表處理自己：

clean = [x for x in usertext if x not in stop_words]

或者你應該做的應用，與只接受一個字在時間的函數：

clean = usertext.apply(lambda x: x if x not in stop_words else '')

正如@讓 - FrançoisFabre在評論中建議，如果你的stop_words是一套而不是一個列表，你可以加快速度：

from stop_words import get_stop_words 

nl_stop_words = set(get_stop_words('dutch')) # NOTE: set 

usertext = ... 
clean = usertext.apply(lambda word: word if word not in nl_stop_words else '')

來源

2017-03-08 15:10:39

從文件中刪除停用詞

回答

相關問題