Python從熊貓數據框中刪除停用詞給出錯誤輸出

我從多個文件中刪除停用詞。首先，我讀取每個文件並從數據框中刪除停用詞。之後，我將數據幀與下一個數據幀連接起來。當我打印數據幀它給了我等的輸出：Python從熊貓數據框中刪除停用詞給出錯誤輸出

0  [I, , , , , r, e, , h, , h, , h, v, e, ...  
1  [D, , u, , e, v, e, n, , e, , h, e, , u, ...  
2  [R, g, h, , f, r, , h, e, , e, c, r, , w, ...  
3  [A, f, e, r, , c, l, l, n, g, , n, , p, l, ...  
4  [T, h, e, r, e, , v, e, r, e, e, n, , , n, ...

這裏是我的代碼：

allFiles = glob.glob(ROOT_DIR + '/' + DATASET + "/*.csv") 
frame = pd.DataFrame() 
list_ = [] 
stop = stopwords.words('english') 
for file_ in allFiles: 
    chunkDataframe = pd.read_csv(file_,index_col=None, header=0, chunksize=1000) 
    dataframe = pd.concat(chunkDataframe, ignore_index=True) 
    dataframe['Text'] = dataframe['Text'].apply(lambda x: [item for item in x if item not in stop]) 
    print dataframe 
    list_.append(dataframe) 
frame = pd.concat(list_)

請幫我優化讀取與從它刪除停用詞多個文件的方式。

來源

2017-04-03 lucy

您能否提供[MCVE]？ – IanS

dataframe['Text']包含單個字符串，而不是單詞列表。因此，如果使用lambda x: [item for item in x if item not in stop]對它進行迭代，則可以逐個字符遍歷它，並生成一個字符列表作爲結果。要逐字地更改它，請將其更改爲：

lambda x: [item for item in string.split(x) if item not in stop]

來源

2017-04-03 13:25:20 acidtobi

在這樣的情況下，如何讓結果正確顯示？當我嘗試使用這個時，我得到：> – mkheifetz

Python從熊貓數據框中刪除停用詞給出錯誤輸出

回答

相關問題