閱讀csv文件，刪除停用詞，找到唯一的詞

我想讀取一個csv文件 - 它有300萬鳴叫。最後，我想刪除停用詞，並獲得最多2,000個唯一字以及它們的頻率。但是，在我到達這一點之前，我遇到了一個錯誤。這裏是我的代碼：閱讀csv文件，刪除停用詞，找到唯一的詞

import nltk 
from nltk.corpus import stopwords 
import csv 

f = open("/Users/shannonmcgregor/Desktop/ShanTweets.csv") 
shannon_sample_tweets = f.read() 
f.close() 

filtered_tweets = [w for w in shannon_sample_tweets if not w in stopwords.words('english')]

而我得到的錯誤後，我跑是：

__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

誰能幫我弄清是怎麼回事？我確實把# -*- coding: utf-8 -*-,放在我的源代碼頂部

來源

2014-12-03 shannimcg

當您嘗試比較unicode字符串和非unicode字符串時，Python會生氣。測試您的csv詞語和stoplist詞語的類型將會有所幫助。要做到這一點，請嘗試'在shannon_sample_tweets中使用x：print type（x）'並且在stopwords.words（'english'）中嘗試使用y'：print type（y）'運行這些行會告訴你兩者中的任何一個還是兩者都處於Unicode格式。一旦你知道哪些不是unicode，你可以用unicode（string_thats_not_in_unicode）'把這個字符串帶入unicode。我希望這有幫助！ – duhaime 2014-12-04 01:16:38

謝謝@duhaime - 你寫的代碼告訴我，shannon_sample_tweets是''，停用詞是''，但是當我運行命令'unicode（shannon_sample_tweets）'時，出現以下錯誤 '>>> unicode的（shannon_sample_tweets）回溯（最近通話最後一個）：文件「」，1號線，在的UnicodeDecodeError： 'ASCII' 編解碼器不能在100位字節解碼0xd5：序數不在範圍內（ 128）' – shannimcg 2014-12-04 02:07:07

好，您的評論清除了一些東西。爲了讓您的CSV轉換成Unicode，你應該運行：import codecs則：

f = codecs.open("/Users/shannonmcgregor/Desktop/ShanTweets.csv","r","utf-8")

然後，如果你重新檢查CSV的類型，你應該看到的Unicode。這當然假設你的Tweets符合utf-8，這似乎是這種情況（我快速瀏覽了一下！）。如果你打算在Python中使用字符串，我建議閱讀編碼 - 它們將對你的工作變得重要。

來源

2014-12-04 14:44:59 duhaime

謝謝 - 我實際上修改了這個以更改爲-Latin1（而不是utf-8）以最終使其工作。 – shannimcg 2014-12-05 15:43:20

太棒了！很高興幫助 – duhaime 2014-12-05 17:52:29

閱讀csv文件，刪除停用詞，找到唯一的詞

回答

相關問題