使用sklearn的解碼/編碼load_files

我正在按照教程 https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb 瞭解機器學習和文本。使用sklearn的解碼/編碼load_files

就我而言，我正在使用我下載的推文，在他們正在使用的完全相同的目錄結構（嘗試學習情感分析）中使用正面和負面的推文。

在這裏，在IPython的筆記本我打開我的數據，就像他們做的事：

tweets_train =load_files('Path to my training Tweets')

然後我嘗試用CountVectorizer適合他們

vect = CountVectorizer().fit(text_train)

我得到

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 561: invalid continuation byte

這是因爲我的Tweets中有各種非標準文字嗎？我沒有做我的鳴叫的任何清理（我假設有一些與幫助，以使單詞一袋工作庫？）

編輯：

：我用用Twython下載鳴叫代碼

def get_tweets(user): 
    twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET,ACCESS_KEY,ACCESS_SECRET) 
    user_timeline = twitter.get_user_timeline(screen_name=user,count=1) 
    lis = user_timeline[0]['id'] 
    lis = [lis] 
    for i in range(0, 16): ## iterate through all tweets 
    ## tweet extract method with the last list item as the max_id 
     user_timeline = twitter.get_user_timeline(screen_name=user, 
     count=200, include_retweets=False, max_id=lis[-1]) 
     for tweet in user_timeline: 
      lis.append(tweet['id']) ## append tweet id's 
      text = str(tweet['text']).replace("'", "") 
      text_file = open(user, "a") 
      text_file.write(text) 
      text_file.close()

來源

2017-05-26 Amanda_Panda

這意味着您要麼使用UTF-8以外的編碼存儲數據，要麼數據以某種方式損壞。請提供有關如何下載並將推文保存到磁盤的詳細信息（=代碼）。 – lenz

請參閱編輯代碼以下載推文。 –

你也可以顯示你如何從'tweets_train'到'text_train'？ – lenz

您將得到一個UnicodeDecodeError，因爲您的文件正在使用錯誤的文本編碼進行解碼。如果這對您來說毫無意義，請確保您瞭解Unicode和文本編碼的基礎知識，例如。與official Python Unicode HOWTO。

首先，您需要找出用於在磁盤上存儲推文的編碼。當您將它們保存到文本文件中時，您使用內置的open函數而不指定編碼。這意味着使用了系統的默認編碼。檢查這一點，例如，在交互式會話：

>>> f = open('/tmp/foo', 'a') 
>>> f 
<_io.TextIOWrapper name='/tmp/foo' mode='a' encoding='UTF-8'>

在這裏你可以看到，在我的本地環境的缺省編碼設置爲UTF-8。您也可以直接與檢查

>>> import sys 
>>> sys.getdefaultencoding() 
'utf-8'

的默認編碼還有其他的方法，找出使用的是什麼編碼的文件。例如，如果您碰巧在Unix平臺上工作，Unix工具file就非常適合猜測現有文件的編碼。

一旦你認爲你知道使用的編碼寫文件，你可以在load_files()功能指定此：

tweets_train = load_files('path to tweets', encoding='latin-1')

...如果你發現的Latin-1是爲編碼用於推文;否則相應調整。

來源

2017-05-26 13:28:15 lenz

謝謝，我今天下午回家時會嘗試一下你的建議。 –

如果它不起作用，請嘗試'encoding = ...'CountVectorizer（）'構造函數中的''參數，而不是'load_files（）'函數。 – lenz

謝謝！你讓我指出了正確的方向，我最終發現latin-1是我需要的編碼（它在f中打開）。 –

使用sklearn的解碼/編碼load_files

回答

相關問題