UnicodeDecodeError意外的數據結束時數據結束

我是新來的python，我正嘗試工作在一小塊Yelp上！數據集是JSON，但我轉換爲CSV，使用pandas庫和NLTK。UnicodeDecodeError意外的數據結束時數據結束

在對數據進行預處理時，我首先嚐試刪除所有標點符號以及最常見的停用詞。做完這些之後，我想要應用Porter Stemming算法，該算法在nltk.stem中可用。

這裏是我的代碼：

"""A method for removing the noise in the data and the most common stop.words (NLTK).""" 
def stopWords(review): 

    stopset = set(stopwords.words("english")) 
    review = review.lower() 
    review = review.replace(".","") 
    review = review.replace("-"," ") 
    review = review.replace(")","") 
    review = review.replace("(","") 
    review = review.replace("i'm"," ") 
    review = review.replace("!","") 
    review = re.sub("[[email protected]#*;:<+>~-]", '', review) 
    row = review.split() 

    tokens = ' '.join([word for word in row if word not in stopset]) 
    return tokens

和我的制止方法我寫到這裏的令牌輸入：

"""A method for stemming the words to their roots using Porter Algorithm (NLTK)""" 
def stemWords(impWords): 
    stemmer = stem.PorterStemmer() 
    tok = stopWords(impWords) 
    ======================================================================== 
    stemmed = " ".join([stemmer.stem(str(word)) for word in tok.split(" ")]) 
    ======================================================================== 
    return stemmed

但我得到一個錯誤UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: unexpected end of data。 '=='裏面的那行是給我的錯誤。

我已嘗試清理數據並刪除所有特殊字符！@＃$^& *和其他人使此工作。但停用詞語工作正常。詞幹不起作用。有人可以告訴我我在哪裏做錯了嗎？

如果我的數據不乾淨，或者unicode字符串在某處出現故障，那麼我可以通過任何方式清理它或修復它，以免它出現此錯誤？我想要阻止，任何建議都會有所幫助。

來源

2015-05-17 Anshul Vyas

閱讀python中的unicode字符串處理。有類型str，但也有一個類型unicode。

我建議：

解碼每一行讀取後，立即縮小不正確的字符在輸入數據（實際數據有誤）
與unicode和u" "串到處找活幹。

來源

2015-05-17 08:15:48

有一個簡單的方法來過濾掉這些惱人的錯誤。您可以預處理每個評論與

review = review.encode('ascii', errors='ignore')

刪除所有無效字符。 ascii字符是你想要根據你的代碼。

來源

2016-03-08 19:38:26 Linjie

UnicodeDecodeError意外的數據結束時數據結束

回答

相關問題