從推文中刪除停用詞Python

我試圖從我從Twitter導入的推文中刪除停用詞。刪除停用詞後，字符串列表將被放置在同一行的新列中。我可以一次輕鬆地完成這一行，但試圖在整個數據框上循環方法似乎並不成功。從推文中刪除停用詞Python

我該怎麼做？

摘錄我的數據：

tweets['text'][0:5] 
Out[21]: 
0 Why #litecoin will go over 50 USD soon ? So ma... 
1 get 20 free #bitcoin spins at... 
2 Are you Bullish or Bearish on #BMW? Start #Tra... 
3 Are you Bullish or Bearish on the S&amp;P 500?... 
4 TIL that there is a DAO ExtraBalance Refund. M...

在單行方案的以下工作：

from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english')) 
tweets['text-filtered'] = "" 

word_tokens = word_tokenize(tweets['text'][1]) 
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
tweets['text-filtered'][1] = filtered_sentence 

tweets['text-filtered'][1] 
Out[22]: 
['get', 
'20', 
'free', 
'#', 
'bitcoin', 
'spins', 
'withdraw', 
'free', 
'#', 
'btc', 
'#', 
'freespins', 
'#', 
'nodeposit', 
'#', 
'casino', 
'#', 
'...', 
':']

我在一個循環的嘗試並不成功：

for i in tweets: 
    word_tokens = word_tokenize(tweets.get(tweets['text'][i], False)) 
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    tweets['text-filtered'][i] = filtered_sentence

一個片段的追溯：

Traceback (most recent call last): 

    File "<ipython-input-23-6d7dace7a2d0>", line 2, in <module> 
    word_tokens = word_tokenize(tweets.get(tweets['text'][i], False)) 

... 

KeyError: 'id'

基於@ Prune的回覆，我設法糾正了我的錯誤。這裏是一個可能的解決方案：

count = 0  
for i in tweets['text']: 
    word_tokens = word_tokenize(i) 
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    tweets['text-filtered'][count] = filtered_sentence 
    count += 1

我以前的嘗試是循環訪問數據框，tweets的列。推文中的第一列是「id」。

tweets.columns 
Out[30]: 
Index(['id', 'user_bg_color', 'created', 'geo', 'user_created', 'text', 
     'polarity', 'user_followers', 'user_location', 'retweet_count', 
     'id_str', 'user_name', 'subjectivity', 'coordinates', 
     'user_description', 'text-filtered'], 
     dtype='object')

來源

2017-05-31 Kevin

當你得到一個解決方案時，請記住投票有用的東西並接受你最喜歡的答案（即使你必須自己寫），所以堆棧溢出可以正確地存檔問題。 – Prune

你感到困惑列表索引：

for i in tweets: 
    word_tokens = word_tokenize(tweets.get(tweets['text'][i], False)) 
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    tweets['text-filtered'][i] = filtered_sentence

注意tweets是一本字典; tweets['text']字符串列表。因此，for i in tweets以任意順序返回tweets中的所有密鑰：字典密鑰。看起來「id」是第一個返回的。當您嘗試分配tweets['text-filtered']['id'] = filtered_sentence時，就沒有這樣的元素。

嘗試更溫和地進行編碼：從內部開始，每次編碼幾行，然後按照更復雜的控制結構工作。在繼續之前調試每個添加。在這裏，你似乎已經失去了什麼是數字索引，什麼是列表，什麼是字典。

由於您沒有做任何可見的調試，或提供了上下文，我無法爲您修復整個程序 - 但這應該讓您開始。

來源

2017-05-31 23:35:13 Prune

索引，列表和字典之間的混淆是問題所在！我根據你的建議更新了我的答案 – Kevin

從推文中刪除停用詞Python

回答

相關問題