2013-11-22 351 views
4

我的特徵提取出文本語料庫,而我使用從TD-fidf向量化和截斷奇異值分解scikit學習。但是,由於我想要嘗試的算法需要密集矩陣,並且矢量化程序返回稀疏矩陣,所以我需要將這些矩陣轉換爲密集數組。但是,每當我嘗試轉換這些數組時,我得到一個錯誤,告訴我我的numpy數組對象沒有「toarray」屬性。我究竟做錯了什麼?AttributeError的:爲了實現這一「numpy.ndarray」對象有沒有屬性「指定者」

功能:

def feature_extraction(train,train_test,test_set): 
    vectorizer = TfidfVectorizer(min_df = 3,strip_accents = "unicode",analyzer = "word",token_pattern = r'\w{1,}',ngram_range = (1,2))   

    print("fitting Vectorizer") 
    vectorizer.fit(train) 

    print("transforming text") 
    train = vectorizer.transform(train) 
    train_test = vectorizer.transform(train_test) 
    test_set = vectorizer.transform(test_set) 

    print("Dimensionality reduction") 
    svd = TruncatedSVD(n_components = 100) 
    svd.fit(train) 
    train = svd.transform(train) 
    train_test = svd.transform(train_test) 
    test_set = svd.transform(test_set) 

    print("convert to dense array") 
    train = train.toarray() 
    test_set = test_set.toarray() 
    train_test = train_test.toarray() 

    print(train.shape) 
    return train,train_test,test_set 

回溯:

Traceback (most recent call last): 
    File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 24, in <module> 
    x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set) 
    File "C:\Users\Anonymous\workspace\final_submission\src\Preprocessing.py", line 57, in feature_extraction 
    train = train.toarray() 
AttributeError: 'numpy.ndarray' object has no attribute 'toarray' 

更新: 威利指出,我矩陣是稀疏的假設可能是錯誤的。所以,我想我的數據反饋到我的算法降維,當我排除降維,這給了我身邊53K功能我得到以下錯誤,它實際上工作,無需任何轉換,但是:

Traceback (most recent call last): 
    File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 28, in <module> 
    result = bayesian_ridge(x_train,x_test,y_train,y_test,test_set) 
    File "C:\Users\Anonymous\workspace\final_submission\src\Algorithms.py", line 84, in bayesian_ridge 
    algo = algo.fit(x_train,y_train[:,i]) 
    File "C:\Python27\lib\site-packages\sklearn\linear_model\bayes.py", line 136, in fit 
    dtype=np.float) 
    File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 220, in check_arrays 
    raise TypeError('A sparse matrix was passed, but dense ' 
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array. 

有人能解釋這個?

UPDATE2

按照要求,我會給所有相關的代碼。由於它分散在不同的文件中,因此我只會逐步發佈它。爲了清晰起見,我會將所有模塊導出。

這是我如何我進行預處理代碼:

def regexp(data): 
    for row in range(len(data)): 
     data[row] = re.sub(r'[\W_]+'," ",data[row]) 
     return data 

def clean_the_text(data): 
    alist = [] 
    data = nltk.word_tokenize(data) 
    for j in data: 
     j = j.lower() 
     alist.append(j.rstrip('\n')) 
    alist = " ".join(alist) 
    return alist 
def loop_data(data): 
    for i in range(len(data)): 
     data[i] = clean_the_text(data[i]) 
    return data 


if __name__ == "__main__": 
    print("loading train") 
    train_text = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"train.csv")))[:,1])))) 
    print("loading test_set") 
    test_set = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"test.csv")))[:,1])))) 

分裂我train_set成x_train和cross_validation一個x_test我使用上面的feature_extraction函數變換我的數據之後。

x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set) 

最後我給他們到我的算法

def bayesian_ridge(x_train,x_test,y_train,y_test,test_set): 
    algo = linear_model.BayesianRidge() 
    algo = algo.fit(x_train,y_train) 
    pred = algo.predict(x_test) 
    error = pred - y_test 
    result.append(algo.predict(test_set)) 
    print("Bayes_error: ",cross_val(error)) 
    return result 
+4

如果'train'已經是一個ndarray,那麼你關於它返回一個稀疏矩陣假設是不正確。 – willy

+0

你可能是對的,讓我檢查一下。 – Learner

+0

檢查了它。現在即將添加編輯到我的問題。 – Learner

回答

1

TruncatedSVD.transform返回一個數組,而不是一個稀疏矩陣。事實上,在當前版本的scikit-learn中,只有向量化器返回稀疏矩陣。

+0

@Learner:它是在[文檔字符串](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD.transform)用於該方法。 –

相關問題