有關於如何分類數據編碼爲Sklearn Decission樹木幾個職位,但是從Sklearn文檔,我們得到了這些傳遞分類數據Sklearn決策樹
Some advantages of decision trees are:
(...)
Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.
但運行以下腳本
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])
輸出以下錯誤:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b
我知道在R中可以通過Sklearn傳遞分類數據,這有可能嗎?
-1這是誤導。就目前而言,sklearn決策樹不處理分類數據 - [見問題#5442](https://github.com/scikit-learn/scikit-learn/issues/5442)。這種使用標籤編碼的方法將轉換爲DecisionTreeClassifier()**將視爲數字**的整數。如果你的分類數據不是序數,那麼這不好 - 你最終會得到分裂,這是不合理的。使用'OneHotEncoder'是目前唯一有效的方法,但計算量很大。 – kungfujam
@Abhinav,是否有可能在一個數據框的多個列上同時應用「LabelEncoder」?例如,在問題的數據框中,我們可以做一些事情,比如'le.fit_transform(data [['A','B','C']])'一次爲所有分類列獲取標籤嗎?或者應該明確指定分類列來轉換分類列。 – Minu
@kungfujam,另外,我想'一個熱門編碼'分類列一旦我'LabelEncode'他們 - 解決@kungfujam指出的問題。一旦我完成標籤編碼,我該怎麼做? – Minu