2016-01-11 66 views
5

(或...我只是編輯列表的列表)轉動不規則詞典列表爲大熊貓數據框

是否有轉換這樣

food2 = {} 
food2["apple"] = ["fruit", "round"] 
food2["bananna"] = ["fruit", "yellow", "long"] 
food2["carrot"] = ["veg", "orange", "long"] 
food2["raddish"] = ["veg", "red"] 

的結構到現有的Python /大熊貓方法像這樣的數據透視表?

+---------+-------+-----+-------+------+--------+--------+-----+ 
|   | fruit | veg | round | long | yellow | orange | red | 
+---------+-------+-----+-------+------+--------+--------+-----+ 
| apple | 1  |  | 1  |  |  |  |  | 
+---------+-------+-----+-------+------+--------+--------+-----+ 
| bananna | 1  |  |  | 1 | 1  |  |  | 
+---------+-------+-----+-------+------+--------+--------+-----+ 
| carrot |  | 1 |  | 1 |  | 1  |  | 
+---------+-------+-----+-------+------+--------+--------+-----+ 
| raddish |  | 1 |  |  |  |  | 1 | 
+---------+-------+-----+-------+------+--------+--------+-----+ 

天真地,我可能只是通過字典循環。我看到我如何在每個內部列表上使用地圖,但我不知道如何將它們加入/堆疊在字典上。一旦我做了加入他們的行列,我可以只使用pandas.pivot_table

for key in food2: 
    attrlist = food2[key] 
    onefruit_pairs = map(lambda x: [key, x], attrlist) 
    one_fruit_frame = pd.DataFrame(onefruit_pairs, columns=['fruit', 'attr']) 
    print(one_fruit_frame) 

    fruit attr 
0 bananna fruit 
1 bananna yellow 
2 bananna long 
    fruit attr 
0 carrot  veg 
1 carrot orange 
2 carrot long 
    fruit attr 
0 apple fruit 
1 apple round 
    fruit attr 
0 raddish veg 
1 raddish red 

回答

2

純Python:

from itertools import chain 

def count(d): 
    cols = set(chain(*d.values())) 
    yield ['name'] + list(cols) 
    for row, values in d.items(): 
     yield [row] + [(col in values) for col in cols] 

測試:

>>> food2 = {   
    "apple": ["fruit", "round"], 
    "bananna": ["fruit", "yellow", "long"], 
    "carrot": ["veg", "orange", "long"], 
    "raddish": ["veg", "red"] 
} 

>>> list(count(food2)) 
[['name', 'long', 'veg', 'fruit', 'yellow', 'orange', 'round', 'red'], 
['bananna', True, False, True, True, False, False, False], 
['carrot', True, True, False, False, True, False, False], 
['apple', False, False, True, False, False, True, False], 
['raddish', False, True, False, False, False, False, True]] 

[更新]

性能測試:

>>> from itertools import product 
>>> labels = list("".join(_) for _ in product(*(["ABCDEF"] * 7))) 
>>> attrs = labels[:1000] 
>>> import random 
>>> sample = {} 
>>> for k in labels: 
...  sample[k] = random.sample(attrs, 5) 
>>> import time 
>>> n = time.time(); list(count(sample)); print time.time() - n                 
62.0367980003 

花費不到2分鐘,我的繁忙機器上有279936行1000列(大量鉻標籤打開)。讓我知道,如果表現是不可接受的。

[更新]

測試從對方的回答表現:

>>> n = time.time(); \ 
...  df = pd.DataFrame(dict([(k, pd.Series(v)) for k,v in sample.items()])); \ 
...  print time.time() - n 
72.0512290001 

下一行(df = pd.melt(...))時間太長,所以我取消了測試。因爲它是在繁忙的機器上運行,因此可以用一粒鹽來獲得這個結果。

+0

優秀。對於成千上萬的「水果」和數以千計的屬性,你有沒有直覺去理解它的表現(與一些尚未明確的熊貓魔法相比)? –

+0

我「有」導入itertools –

+1

此解決方案爲簡單而不是性能優化。有很大的改進空間,特別是如果你事先知道屬性的話。更新爲缺少「導入」。 –

1

使用熊貓的答案。

# Test data 
food2 = {} 
food2["apple"] = ["fruit", "round"] 
food2["bananna"] = ["fruit", "yellow", "long"] 
food2["carrot"] = ["veg", "orange", "long"] 
food2["raddish"] = ["veg", "red"] 

df = DataFrame(dict([ (k,Series(v)) for k,v in food2.items() ])) 
# pivoting to long format 
df = pd.melt(df, var_name='item', value_name='categ') 
# cross-tabulation 
df = pd.crosstab(df['item'], df['categ']) 
# sorting index, maybe not necessary  
df.sort_index(inplace=True) 
df 

categ fruit long orange red round veg yellow 
item             
apple  1  0  0 0  1 0  0 
bananna  1  1  0 0  0 0  1 
carrot  0  1  1 0  0 1  0 
raddish  0  0  0 1  0 1  0 
+0

我也喜歡這個。你有一個錯字:cater與類別 –

+0

謝謝,只是修正了錯字。 – Romain

+0

使用與其他答案相同的輸入進行測試。奇怪的是,對於那個輸入來說性能還不是那麼遠(279936行×1000列,非常稀疏)。 –