2013-11-15 31 views
0

我對文本數據進行了預處理,但是每當我應用NLTK stemmer時,我都會得到一個NoneTypes的列表。我無法弄清楚爲什麼發生這種情況,我不知道如何解決它。NLTK stemmer返回NoneTypes的列表

這是我的文字數據的外觀經過處理:

處理前:詞幹後

train = loop_data(train) 

In [12]: 
undefined 



train 
Out[12]: 
array(['Jazz for a Rainy Afternoon : { link }', 
     'RT : @ mention : I love rainy days .', 
     'Good Morning Chicago ! Time to kick the Windy City in the nuts and head back West !', 
     ..., 
     'OMG # WeatherForecast for tomm 80 degrees & Sunny & lt ; === # NeedThat # Philly # iMustSeeItToBelieveIt yo', 
     "@ mention Oh no ! We had cold weather early in the week , but now it 's getting warmer ! Hoping the rain holds out to Saturday !", 
     'North Cascades Hwy to reopen Wed. : quite late after a long , deep winter. Only had to clear snow 75 ft deep { link }'], dtype=object) 

最後:

In [10]: 
undefined 



import pandas as pd 
import numpy as np 
import glob 
import os 
import nltk 
dir = "C:\Users\Anonymous\Desktop\KAGA FOLDER\Hashtags" 
train = np.array(pd.read_csv(os.path.join(dir,"train.csv")))[:,1] 
def clean_the_text(data): 
    alist = [] 
    data = nltk.word_tokenize(data) 
    for j in data: 
     alist.append(j.rstrip('\n')) 
    alist = " ".join(alist) 
    return alist 

def stemmer(data): 
    stemmer = nltk.stem.PorterStemmer() 
    new_list = [] 
    new_list = [new_list.append(stemmer.stem(word)) for word in data] 
    return new_list 
def loop_data(data): 
    for i in range(len(data)): 
     data[i] = clean_the_text(data[i]) 
    return data 
train 


Out[10]: 
array(['Jazz for a Rainy Afternoon: {link}', 
     'RT: @mention: I love rainy days.', 
     'Good Morning Chicago! Time to kick the Windy City in the nuts and head back West!', 
     ..., 
     'OMG #WeatherForecast for tomm 80 degrees & Sunny <=== #NeedThat #Philly #iMustSeeItToBelieveIt yo', 
     "@mention Oh no! We had cold weather early in the week, but now it's getting warmer! Hoping the rain holds out to Saturday!", 
     'North Cascades Hwy to reopen Wed.: quite late after a long, deep winter. Only had to clear snow 75 ft deep {link}'], dtype=object) 

標化和清潔文本後:

In [13]: 
undefined 



train = stemmer(train) 
train 
Out[13]: 
[None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 
None, 

回答

1

問題是在這裏:new_list = [new_list.append(stemmer.stem(word)) for word in data]。它應該是

new_list = [stemmer.stem(word) for word in data] 
# or 
# new_data = map(stemmer.stem, data) # returns a map object 

new_list是被追加LEN(數據)次,然後它被設置爲從包含LEN(數據)的列表中理解語句的新名單的new_list.append這是無結果。

+0

非常感謝。那確實是問題所在。 – Learner