2016-09-09 35 views
2

我很新的Python,我試圖用NLTK刪除我的文件的停用詞。 代碼正在工作,但它是分隔標點符號,如果我的文本是帶有提及(@user)的推文,我會得到「@ user」。 後來我需要做一個詞頻,我需要提及和標籤才能正常工作。 我的代碼:Python - NLTK分隔標點符號

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import codecs 
arquivo = open('newfile.txt', encoding="utf8") 
linha = arquivo.readline() 
while linha: 
    stop_word = set(stopwords.words("portuguese")) 
    word_tokens = word_tokenize(linha) 
    filtered_sentence = [w for w in word_tokens if not w in stop_word] 
    filtered_sentence = [] 
    for w in word_tokens: 
     if w not in stop_word: 
      filtered_sentence.append(w) 
    fp = codecs.open("stopwords.txt", "a", "utf-8") 
    for words in (filtered_sentence): 
     fp.write(words + " ") 
    fp.write("\n") 
    linha= arquivo.readline() 

編輯 不知道這是做的最好的方式,但我固定它是這樣的:

for words in (filtered_sentence): 
     fp.write(words) 
     if words not in string.punctuation: 
      fp.write(" ") 
    fp.write("\n") 

回答

3

,而不是word_tokenize,你可以使用Twitter-aware tokenizer提供通過nltk:

from nltk.tokenize import TweetTokenizer 

... 
tknzr = TweetTokenizer() 
... 
word_tokens = tknzr.tokenize(linha) 
+0

這樣更好,非常感謝你 – urukh

相關問題