2017-05-09 94 views
1

我希望能夠獲取字典(記錄)的列表,其中某些列的值列表爲單元格的值。下面是一個例子Python - 字符串列表中的特徵散列列表字符串

[{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}] 

我怎麼能借此輸入並對其進行功能散列(在我的數據集我有成千上萬的列)。目前我正在使用一種熱門編碼,但這似乎消耗了很多內存(比我的系統上的更多)。

我試圖把我的數據集作爲上面,就有了一個錯誤:

x__ = h.transform(data) 

Traceback (most recent call last): 

    File "<ipython-input-14-db4adc5ec623>", line 1, in <module> 
    x__ = h.transform(data) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform 
    _hashing.transform(raw_X, self.n_features, self.dtype) 

    File "sklearn/feature_extraction/_hashing.pyx", line 52, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:2103) 

TypeError: a float is required 

我也試圖把它變成一個數據幀,並把它傳遞給散列器:

x__ = h.transform(x_y_dataframe) 

Traceback (most recent call last): 

    File "<ipython-input-15-109e7f8018f3>", line 1, in <module> 
    x__ = h.transform(x_y_dataframe) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform 
    _hashing.transform(raw_X, self.n_features, self.dtype) 

    File "sklearn/feature_extraction/_hashing.pyx", line 46, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:1928) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 138, in <genexpr> 
    raw_X = (_iteritems(d) for d in raw_X) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 15, in _iteritems 
    return d.iteritems() if hasattr(d, "iteritems") else d.items() 

AttributeError: 'unicode' object has no attribute 'items' 

任何想法如何我可以用熊貓或sklearn來實現這個嗎?或者,也許我可以一次構建幾千行的虛擬變量?

這裏是我如何得到我的使用大熊貓虛擬變量:

def one_hot_encode(categorical_labels): 
    res = [] 
    tmp = None 
    for col in categorical_labels: 
     v = x[col].astype(str).str.strip('[]').str.get_dummies(', ')#cant set a prefix 
     if len(res) == 2: 
      tmp = pandas.concat(res, axis=1) 
      del res 
      res = [] 
      res.append(tmp) 
      del tmp 
      tmp = None 
     else: 
      res.append(v) 
    result = pandas.concat(res, axis=1) 
    return result 
+0

您可以將列表到元組,這是哈希的。 – IanS

回答

1

考慮以下方法:

from sklearn.feature_extraction.text import CountVectorizer 

lst = [{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}] 

df = pd.DataFrame(lst) 

vect = CountVectorizer() 

X = vect.fit_transform(df.fruit.map(lambda x: ' '.join(x) if isinstance(x, list) else x)) 

r = pd.DataFrame(X.A, columns=vect.get_feature_names(), index=df.index) 

df.join(r) 

結果:

In [66]: r 
Out[66]: 
    apple banana 
0  1  0 
1  1  1 

In [67]: df.join(r) 
Out[67]: 
    age   fruit apple banana 
0 27   apple  1  0 
1 32 [apple, banana]  1  1 

UPDATE:從開始Pandas 0.20.1我們可以直接從spars創建SparseDataFrame Ë矩陣:

In [13]: r = pd.SparseDataFrame(X, columns=vect.get_feature_names(), index=df.index, default_fill_value=0) 

In [14]: r 
Out[14]: 
    apple banana 
0  1  0 
1  1  1 

In [15]: r.memory_usage() 
Out[15]: 
Index  80 
apple  16 # 2 * 8 byte (np.int64) 
banana  8 # 1 * 8 byte (as there is only one `1` value) 
dtype: int64 

In [16]: r.dtypes 
Out[16]: 
apple  int64 
banana int64 
dtype: object 
+0

雖然我看起來內存不足(32 GB),但確實有效,我想有很多列。我也注意到,當我將df分開時,爲了能夠做到這一點,它給了我很多nans(即使我提前從我的數據幀中刪除所有nans) – Kevin

+0

我意識到我得到na的原因是因爲我沒有將軸設置爲1 – Kevin

+0

@Kevin,在Pandas 0.20.1中,您可以直接從稀疏矩陣(CountVectorizer的結果)創建SparseDataFrame。請檢查我的更新的答案 – MaxU