2017-04-06 44 views
0

我對Python和NLTK都是新手。我必須從語料庫中提取名詞短語,然後使用NLTK刪除停用詞。我已經做了我的編碼,但仍然有錯誤。任何人都可以幫我解決這個問題嗎?或者也請推薦是否有更好的解決方案。謝謝從訓練語料庫中提取名詞短語時出錯並使用NLTK刪除停用詞

import nltk 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 

docid='19509' 
title='Example noun-phrase and stop words' 
print('Document id:'),docid 
print('Title:'),title 

#list noun phrase 
content='This is a sample sentence, showing off the stop words filtration.' 
is_noun = lambda pos: pos[:2] == 'NN' 
tokenized = nltk.word_tokenize(content) 
nouns = [word for (word,pos) in nltk.pos_tag(tokenized) if is_noun(pos)] 
print('All Noun Phrase:'),nouns 

#remove stop words 
stop_words = set(stopwords.words("english")) 

example_words = word_tokenize(nouns) 
filtered_sentence = [] 

for w in example_words: 
    if w not in stop_words: 
    filtered_sentence.append(w) 

print('Without stop words:'),filtered_sentence 

而且我得到了以下錯誤

Traceback (most recent call last): 
File "C:\Users\User\Desktop\NLP\stop_word.py", line 20, in <module> 
    example_words = word_tokenize(nouns) 
File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 109,in 
word_tokenize 
    return [token for sent in sent_tokenize(text, language) 
File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 94, in 
sent_tokenize 
    return tokenizer.tokenize(text) 
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in 
tokenize 
    return list(self.sentences_from_text(text, realign_boundaries)) 
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in 
sentences_from_text 
    return [text[s:e] for s, e in self.span_tokenize(text,realign_boundaries)] 
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in 
span_tokenize 
    return [(sl.start, sl.stop) for sl in slices] 
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in 
_realign_boundaries 
    for sl1, sl2 in _pair_iter(slices): 
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 310, in 
_pair_iter 
    prev = next(it) 
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in 
_slices_from_text 
    for match in self._lang_vars.period_context_re().finditer(text): 
TypeError: expected string or buffer 
+0

你能解釋一下錯誤究竟是什麼?哪部分不工作? – christinabo

+0

錯誤太多,我無法理解。我已經附上上面的錯誤@christinabo – Nur

+0

http://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python可能的重複? – alvas

回答

1

因爲函數word_tokenize期待一個字符串作爲參數,您收到此錯誤,你給字符串列表。 據我瞭解你想達到的目標,在這一點上你不需要標記化。直到print('All Noun Phrase:'),nouns,你有你的句子的所有名詞。要刪除停用詞,你可以使用:

### remove stop words ### 
stop_words = set(stopwords.words("english")) 
# find the nouns that are not in the stopwords 
nouns_without_stopwords = [noun for noun in nouns if noun not in stop_words] 
# your sentence is now clear 
print('Without stop words:',nouns_without_stopwords) 

當然,在這種情況下,你必須與名詞相同的結果,因爲沒有一個名詞是一個停用詞。

我希望這會有所幫助。

+0

是的,它的工作..感謝了很多幫助我@christinabo – Nur

相關問題