Sklearn的LabelBinarizer可以類似於DictVectorizer嗎？

我有一個數據集，其中包括數字和分類功能，其中分類功能可以包含一個標籤列表。例如：Sklearn的LabelBinarizer可以類似於DictVectorizer嗎？

RecipeId Ingredients TimeToPrep 
1   Flour, Milk 20 
2   Milk   5 
3   Unobtainium 100

如果我只有每個配方的一個Ingeredient，DictVecorizer會一直優雅地處理的編碼，以適當的虛擬變量：

from sklearn feature_extraction import DictVectorizer 
RecipeData=[{'RecipeID':1,'Ingredients':'Flour','TimeToPrep':20}, {'RecipeID':2,'Ingredients':'Milk','TimeToPrep':5} 
,{'RecipeID':3,'Ingredients':'Unobtainium','TimeToPrep':100} 
dc=DictVectorizer() 
dc.fit_transform(RecipeData).toarray()

給作爲輸出：

array([[ 1., 0., 0., 1., 20.], 
     [ 0., 1., 0., 2., 5.], 
     [ 0., 0., 1., 3., 100.]])

將分類標籤編碼爲布爾特徵時，正確處理整數特徵。

然而，DictVectorizer不能處理列表值的特徵和扼流圈

RecipeData = [{ 'RecipeID'：1，'成分'：[ '麪粉'， '牛奶']，'TimeToPrep '：20'，''配方ID'：2，'配料'：'牛奶'，'時間預製'：5} ，{'配方ID'：3，'配料'：'Unobtainium'，'TimeToPrep'：100}

LabelBinarizer正確處理這一點，但分類變量必須被提取並分別處理：

from sklearn.preprocessing import LabelBinarizer 
lb=LabelBinarizer() 
lb.fit_transform([('Flour','Milk'), ('Milk',), ('Unobtainium',)]) 
array([[1, 1, 0], 
     [0, 1, 0], 
     [0, 0, 1]])

這就是我目前的做法 - 從混合數字/分類輸入數組中提取包含標籤列表的分類特徵，使用LabelBinarizer轉換它們，然後重新粘貼數字特徵。

有沒有更好的方法來做到這一點？

來源

2014-01-15 Noam Kremen

LabelBinarizer適用於類標籤，而不是功能（雖然有正確的按摩它也會處理分類功能）。

DictVectorizer的預期用途是將數據特定的函數映射到樣本上以提取有用的特徵，函數返回dict。所以，解決這個優雅的方式是編寫一個平展的功能類型的字典功能，並與True值替換個別功能列表：

>>> def flatten_ingredients(d): 
...  # in-place version 
...  if isinstance(d.get('Ingredients'), list): 
...   for ingredient in d.pop('Ingredients'): 
...    d['Ingredients=%s' % ingredient] = True 
...  return d 
... 
>>> RecipeData=[{'RecipeID':1,'Ingredients':['Flour','Milk'],'TimeToPrep':20}, {'RecipeID':2,'Ingredients':'Milk','TimeToPrep':5} ,{'RecipeID':3,'Ingredients':'Unobtainium','TimeToPrep':100}] 
>>> map(flatten_ingredients, RecipeData) 
[{'Ingredients=Milk': True, 'RecipeID': 1, 'TimeToPrep': 20, 'Ingredients=Flour': True}, {'RecipeID': 2, 'TimeToPrep': 5, 'Ingredients': 'Milk'}, {'RecipeID': 3, 'TimeToPrep': 100, 'Ingredients': 'Unobtainium'}]

在行動：

>>> from sklearn.feature_extraction import DictVectorizer 
>>> dv = DictVectorizer() 
>>> dv.fit_transform(flatten_ingredients(d) for d in RecipeData).toarray() 
array([[ 1., 1., 0., 1., 20.], 
     [ 0., 1., 0., 2., 5.], 
     [ 0., 0., 1., 3., 100.]]) 
>>> dv.feature_names_ 
['Ingredients=Flour', 'Ingredients=Milk', 'Ingredients=Unobtainium', 'RecipeID', 'TimeToPrep']

（如果我是你，我也刪除了RecipeID，因爲它不太可能是一個有用的功能，它可以很容易地導致過度配合。）

來源

2014-01-15 12:04:01

謝謝，這比我拼湊在一起更優雅。這些ID自然不會在學習任務中使用，它們只是幫助我在調試時識別轉換數組中的相關行。 –

Sklearn的LabelBinarizer可以類似於DictVectorizer嗎？

回答

相關問題