如何將數據導入SKLearn的ndarray格式？

Scikit-Learn是一個偉大的Python模塊，它提供了許多算法的support vector machine。過去幾天我一直在學習如何使用該模塊，並且我注意到它很大程度上依賴於單獨的numpy模塊。如何將數據導入SKLearn的ndarray格式？

我明白模塊的功能，但我仍然在學習它是如何工作的。下面是我使用sklearn什麼的一個非常簡單的例子：

from sklearn import datasets, svm 
import numpy 

digits = datasets.load_digits() #image pixel data of digits 0-9 as well as a chart of the corresponding digit to each image 

clf = svm.SVC(gamma=0.001,C=100) #SVC is the algorithm used for classifying this type of data 

x,y = digits.data[:-1], digits.target[:-1] #feed it all the data 
clf.fit(x,y) #"train" the SVM 

print(clf.predict(digits.data[0])) #>>>[0] 
#with 99% accuracy, all of the data consists of 1797 samples. 
#if this number gets smaller, the accuracy decreases. with 10 samples (0-9), 
#accuracy can still be up to as high as 90%.

這是非常基本分類。 共有10類：0,1,2,3,4,5,6,7,8,9。

使用與matplotlib.pyplot以下代碼：

import matplotlib.pyplot as plt #in shell after running previous code 
plt.imshow(digits.images[0],cmap=plt.cm.gray_r,interpolation="nearest") 
plt.show()

給出了以下圖像：

第一像素（左到右，從上到下，像讀出）將表示由第二個相同，但第三個將由7或其他東西（範圍是0到15），第四個是大約13.這是圖像的實際數據：

[[ 0. 0. 5. 13. 9. 1. 0. 0.] 
[ 0. 0. 13. 15. 10. 15. 5. 0.] 
[ 0. 3. 15. 2. 0. 11. 8. 0.] 
[ 0. 4. 12. 0. 0. 8. 8. 0.] 
[ 0. 5. 8. 0. 0. 9. 8. 0.] 
[ 0. 4. 11. 0. 1. 12. 7. 0.] 
[ 0. 2. 14. 5. 10. 12. 0. 0.] 
[ 0. 0. 6. 13. 10. 0. 0. 0.]]

所以我的問題是這樣的：如果我想對文本數據進行分類，例如在錯誤的subforum/category中的論壇帖子，我將如何將這些數據轉換爲數據集示例中使用的數字系統？

來源

2016-07-09 JoshuaS3

你需要將它平鋪到一個單獨的矢量中。所以你的numpy數組將是nx64，其中n是圖像的數量，每列代表圖像中的一個像素。顯然，用這種表示形式的圖像會丟失許多有趣的信息，這就是卷積神經網絡在圖像分類方面一般來說遠優於其的原因之一。 – David

對於每個樣本（例如每個論壇帖子），您必須有一個向量（在python的列表中）。例如，如果您有200個帖子及其各自的類別，則必須有200個培訓數據列表，並且恰好有一個列表對於每個200個類別都有200個元素。培訓類別的每個列表都可以是模型（例如Bag Of Word，請參閱：https://en.wikipedia.org/wiki/Bag-of-words_model）。請注意，所有列表中的培訓必須有相同的元素（相同的漁政船）（例如每個列表必須有3000元，每個元素reperesnt存在或不存在一個字）嘗試看看這個，很容易讓begginers：https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

來源

2016-07-09 15:25:16 Masoud

如何將數據導入SKLearn的ndarray格式？

回答

相關問題