Python - 字符串列表中的特徵散列列表字符串

我希望能夠獲取字典（記錄）的列表，其中某些列的值列表爲單元格的值。下面是一個例子Python - 字符串列表中的特徵散列列表字符串

[{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}]

我怎麼能借此輸入並對其進行功能散列（在我的數據集我有成千上萬的列）。目前我正在使用一種熱門編碼，但這似乎消耗了很多內存（比我的系統上的更多）。

我試圖把我的數據集作爲上面，就有了一個錯誤：

x__ = h.transform(data) 

Traceback (most recent call last): 

    File "<ipython-input-14-db4adc5ec623>", line 1, in <module> 
    x__ = h.transform(data) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform 
    _hashing.transform(raw_X, self.n_features, self.dtype) 

    File "sklearn/feature_extraction/_hashing.pyx", line 52, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:2103) 

TypeError: a float is required

我也試圖把它變成一個數據幀，並把它傳遞給散列器：

x__ = h.transform(x_y_dataframe) 

Traceback (most recent call last): 

    File "<ipython-input-15-109e7f8018f3>", line 1, in <module> 
    x__ = h.transform(x_y_dataframe) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform 
    _hashing.transform(raw_X, self.n_features, self.dtype) 

    File "sklearn/feature_extraction/_hashing.pyx", line 46, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:1928) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 138, in <genexpr> 
    raw_X = (_iteritems(d) for d in raw_X) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 15, in _iteritems 
    return d.iteritems() if hasattr(d, "iteritems") else d.items() 

AttributeError: 'unicode' object has no attribute 'items'

任何想法如何我可以用熊貓或sklearn來實現這個嗎？或者，也許我可以一次構建幾千行的虛擬變量？

這裏是我如何得到我的使用大熊貓虛擬變量：

def one_hot_encode(categorical_labels): 
    res = [] 
    tmp = None 
    for col in categorical_labels: 
     v = x[col].astype(str).str.strip('[]').str.get_dummies(', ')#cant set a prefix 
     if len(res) == 2: 
      tmp = pandas.concat(res, axis=1) 
      del res 
      res = [] 
      res.append(tmp) 
      del tmp 
      tmp = None 
     else: 
      res.append(v) 
    result = pandas.concat(res, axis=1) 
    return result

來源

2017-05-09 Kevin

您可以將列表到元組，這是哈希的。 – IanS

考慮以下方法：

from sklearn.feature_extraction.text import CountVectorizer 

lst = [{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}] 

df = pd.DataFrame(lst) 

vect = CountVectorizer() 

X = vect.fit_transform(df.fruit.map(lambda x: ' '.join(x) if isinstance(x, list) else x)) 

r = pd.DataFrame(X.A, columns=vect.get_feature_names(), index=df.index) 

df.join(r)

結果：

In [66]: r 
Out[66]: 
    apple banana 
0  1  0 
1  1  1 

In [67]: df.join(r) 
Out[67]: 
    age   fruit apple banana 
0 27   apple  1  0 
1 32 [apple, banana]  1  1

UPDATE：從開始Pandas 0.20.1我們可以直接從spars創建SparseDataFrame Ë矩陣：

In [13]: r = pd.SparseDataFrame(X, columns=vect.get_feature_names(), index=df.index, default_fill_value=0) 

In [14]: r 
Out[14]: 
    apple banana 
0  1  0 
1  1  1 

In [15]: r.memory_usage() 
Out[15]: 
Index  80 
apple  16 # 2 * 8 byte (np.int64) 
banana  8 # 1 * 8 byte (as there is only one `1` value) 
dtype: int64 

In [16]: r.dtypes 
Out[16]: 
apple  int64 
banana int64 
dtype: object

來源

2017-05-09 13:22:06 MaxU

雖然我看起來內存不足（32 GB），但確實有效，我想有很多列。我也注意到，當我將df分開時，爲了能夠做到這一點，它給了我很多nans（即使我提前從我的數據幀中刪除所有nans） – Kevin

我意識到我得到na的原因是因爲我沒有將軸設置爲1 – Kevin

@Kevin，在Pandas 0.20.1中，您可以直接從稀疏矩陣（CountVectorizer的結果）創建SparseDataFrame。請檢查我的更新的答案 – MaxU

Python - 字符串列表中的特徵散列列表字符串

回答

相關問題