文本分類與Scikit學習

我做文本分類的兩個標籤與scikit學習..我裝我的文本文件的方法load_files文本分類與Scikit學習

categories={'label0','label1'} 
text_data = load_files(path,categories=categories)

從以下結構

：

train 
├── Label0 
│ ├── 0001.txt 
│ └── 0002.txt 
└── Label1 
    ├── 0001.txt 
    └── 0002.txt

我的問題是，當我嘗試看看text_data.data的形狀，它返回：

print (type(text_data.data)) 
<type 'list'> 

print text_data.data.shape 
AttributeError: 'list' object has no attribute 'shape' 

X = np.array(text_data.data) 
print x.shape 
(35,)

它返回一維數組..我認爲它應該是二維numpy數組或字典，其中第一個將爲文本和另一個將爲類（標籤0或1）.. 我錯過了什麼？

來源

2016-02-06 Ophilia

我編輯的問題..我的問題是，返回的列表是一個維數組...只有文本存儲在那裏......不應該返回列表包含文本以及類標籤？ – Ophilia

一旦你得到你的數據，不要忘記洗牌，也要創建你的驗證集。（儘可能嚴格，你應該在創建文本特徵之前進行洗牌和拆分（按照David Maust的建議）） – user1269942

問題是在調用load_files之後，它還不是一個numpy數組。這只是一個文本列表。您應該使用CountVectorizer或TfidfVectorizer向量化該文本。

例子：

cv = CountVectorizer() 
X = cv.fit_transform(text_data.data) 
y = text_data.target 
print cv.vocabulary_ # Show words in vocabulary with column index 

clf = LogisticRegression() # or other classifier 
clf.fit(X, y)

來源

2016-02-07 01:15:49

謝謝..但是如何將文本鏈接到標籤？它是由兩個列表（text_data.data和text_data.target）的索引？我怎樣才能訪問創建的稀疏矩陣，因爲我想看看它是如何構建的......我可以在CountVectorizer是管道的一部分時訪問矩陣嗎？ – Ophilia

是的，確切地說。「X」的行索引將對應於「text_data.target」的索引 –

已更新我的示例以顯示如何使用稀疏矩陣。要直接查看它，你可以嘗試'X [：5，：]。todense（）'來查看前5行作爲一個密集矩陣。您可以在流水線中使用'CountVectorizer'，但結果會立即發送到下一步。 –

文本分類與Scikit學習

回答

相關問題