2013-07-19 134 views
4

假設我有一個數據幀,如計數,Python的大熊貓:次每個唯一值出現多列

In [7]: source = pd.DataFrame([['amazon.com', 'correct', 'correct'], ['amazon.com', 'incorrect', 'correct'], ['walmart.com', 'incorrect', 'correct'], ['walmart.com', 'incorrect', 'incorrect']], columns=['domain', 'price', 'product']) 

In [8]: source 
Out[8]: 
     domain  price product 
0 amazon.com correct correct 
1 amazon.com incorrect correct 
2 walmart.com incorrect correct 
3 walmart.com incorrect incorrect 

我想算,每個domain,次price == 'correct'price == 'incorrect'數量,和product一樣。換句話說,我想看到像這樣的輸出,

 domain  key  value count 
0 amazon.com price correct  1 
1 amazon.com price incorrect  1 
2 amazon.com product correct  2 
3 walmart.com price incorrect  2 
4 walmart.com product correct  1 
5 walmart.com product incorrect  1 

我該怎麼做?

回答

7

嵌套應用會做

In [24]: source.groupby('domain').apply(lambda x: 
          x[['price','product']].apply(lambda y: y.value_counts())).fillna(0) 

Out[24]: 
         price product 
domain        
amazon.com correct  1  2 
      incorrect  1  0 
walmart.com correct  0  1 
      incorrect  2  1 
+0

這是一個清晰的解決方案。 'x'是一個'DataFrame',它包含所有具有相同域的行,並且將'x'中的'price'和'product'列轉換爲''''系列'對象,每列一個,然後計數每個不同值出現在'y'中的時間。 – duckworthd

0
In [17]: %paste 
    (
     pd.melt(source, id_vars=['domain'], value_vars=['price', 'product']) 
     .groupby(['domain', 'variable', 'value']) 
     .size() 
     .reset_index() 
     .rename(columns={'variable': 'key', 0: 'count'}) 
    ) 

## -- End pasted text -- 
Out[17]: 
     domain  key  value count 
0 amazon.com price correct  1 
1 amazon.com price incorrect  1 
2 amazon.com product correct  2 
3 walmart.com price incorrect  2 
4 walmart.com product correct  1 
5 walmart.com product incorrect  1 
+1

我不會把''輸出=''而不是僅僅打印輸出本身 – Jeff