在python中使用NLTK刪除停用詞

我正在使用NLTK從列表元素中刪除停用詞。這裏是我的代碼片段在python中使用NLTK刪除停用詞

dict1 = {} 
    for ctr,row in enumerate(cur.fetchall()): 
      list1 = [row[0],row[1],row[2],row[3],row[4]] 
      dict1[row[0]] = list1 
      print ctr+1,"\n",dict1[row[0]][2] 
      list2 = [w for w in dict1[row[0]][3] if not w in stopwords.words('english')] 
      print list2

的問題，這不僅消除了禁用詞，但是從例如換言之也被刪除人物從'方向'這個詞'我'和更多的停用詞將被刪除，並進一步它是存儲字符，而不是單詞列表2中。 ie ['O'，'r'，'e'，'n'，'n'，''，'f'，''，'3'，''，'r'，'e'，'r '，'e'，''，'p'，'n'，'\ n'，'\ n'，'\ n'，'O'，'r'，'e'，'n'，'n '，''，'f'，''，'n'，''，'r'，'e'，'r'，'e'，''，'r'，'p'，'l'。 ...................... 雖然我想將它作爲['Orientation'，'.............. ......

來源

2016-07-08 Yash Goel

嘗試先標記您的單詞 – galaxyan

代碼中的內容是什麼？你能發佈更多的上下文代碼嗎？ –

首先，確保list1是一個單詞列表，而不是一個字符數組。在這裏，我可以給你一個代碼片段，你可以利用它。

from nltk import word_tokenize 
from nltk.corpus import stopwords 

english_stopwords = stopwords.words('english') # get english stop words 

# test document 
document = '''A moody child and wildly wise 
Pursued the game with joyful eyes 
''' 

# first tokenize your document to a list of words 
words = word_tokenize(document) 
print(words) 

# the remove all stop words 
content = [w for w in words if w.lower() not in english_stopwords] 
print(content)

輸出將是：

['A', 'moody', 'child', 'and', 'wildly', 'wise', 'Pursued', 'the', 'game', 'with', 'joyful', 'eyes'] 
['moody', 'child', 'wildly', 'wise', 'Pursued', 'game', 'joyful', 'eyes']

來源

2016-07-08 20:14:09

首先，你的列表1的建設是一個有點特殊的給我。我認爲有一個更Python的解決方案：

list1 = row[:5]

那麼，有沒有你所訪問行[3]的理由與dict1 [行[0] [3]，而不是行[3]直接？

最後，假設該行是一個字符串列表，從行[3]構造list2遍歷每個字符，而不是每個字。這可能是爲什麼你解析出'我'和'一個'（和其他幾個字符）。

正確的理解應該是：

list2 = [w for w in row[3].split(' ') if w not in stopwords]

你必須分開以某種方式分割你的琴絃，大概空間。這需要從東西：

'Hello, this is row3'

要

['Hello,', 'this', 'is', 'row3']

迭代，讓你充分的話，而不是單個字符。

來源

2016-07-08 20:28:41 dashiell

TypeError：類型爲'LazyCorpusLoader'的參數不可迭代 –

在python中使用NLTK刪除停用詞

回答

相關問題