2016-03-20 148 views
1

這是我有數據:熊貓:移調在多列中的一個列DF

  cmte_id trans entity st amount fec_id 
date       
2007-08-15 C00112250 24K  ORG  DC 2000 C00431569 
2007-09-26 C00119040 24K  CCM  FL 1000 C00367680 
2007-09-26 C00119040 24K  CCM  MD 1000 C00140715 
2007-07-20 C00346296 24K  CCM  CA 1000 C00434571 
2007-09-24 C00346296 24K  CCM  MA 1000 C00433136 

有跡象表明,我已經離開了爲簡潔起見其他描述的列。 我想對其進行轉換,以使[cmte_id]中的值成爲列標題,[amount]中的值成爲新列中的相應值。我知道這可能是一個簡單的樞軸操作。我曾嘗試以下:

dfy.pivot('cmte_id', 'amount') 
--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-203-e5d2cb89e880> in <module>() 
----> 1 dfy.pivot('cmte_id', 'amount') 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in pivot(self, index, columns, values) 
    3761   """ 
    3762   from pandas.core.reshape import pivot 
-> 3763   return pivot(self, index=index, columns=columns, values=values) 
    3764 
    3765  def stack(self, level=-1, dropna=True): 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in pivot(self, index, columns, values) 
    323   append = index is None 
    324   indexed = self.set_index(cols, append=append) 
--> 325   return indexed.unstack(columns) 
    326  else: 
    327   if index is None: 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in unstack(self, level) 
    3857   """ 
    3858   from pandas.core.reshape import unstack 
-> 3859   return unstack(self, level) 
    3860 
    3861  #---------------------------------------------------------------------- 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in unstack(obj, level) 
    402  if isinstance(obj, DataFrame): 
    403   if isinstance(obj.index, MultiIndex): 
--> 404    return _unstack_frame(obj, level) 
    405   else: 
    406    return obj.T.stack(dropna=False) 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in _unstack_frame(obj, level) 
    442  else: 
    443   unstacker = _Unstacker(obj.values, obj.index, level=level, 
--> 444        value_columns=obj.columns) 
    445   return unstacker.get_result() 
    446 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in __init__(self, values, index, level, value_columns) 
    96 
    97   self._make_sorted_values_labels() 
---> 98   self._make_selectors() 
    99 
    100  def _make_sorted_values_labels(self): 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in _make_selectors(self) 
    134 
    135   if mask.sum() < len(self.index): 
--> 136    raise ValueError('Index contains duplicate entries, ' 
    137        'cannot reshape') 
    138 

ValueError: Index contains duplicate entries, cannot reshape 

期望的最終結果(除了附加列,如 '反式',fec_id, 'ST' 等)將是這個樣子:

date C00112250 C00119040 C00119040 C00346296 C00346296 
2007-ago-15 2000     
2007-set-26    1000    
2007-set-26       1000  
2007-lug-20          1000 
2007-set-24             1000 

不任何人都知道我如何能夠接近最終產品?

+0

請檢查您輸入採樣數據和預期的結果集 - 這肯定是不對的。如果要將'cmte_id'轉換爲列,那麼您應該在包含'cmte_id'輸入數據框中的值的預期輸出列名稱中進行輸入 - 事實並非如此。除此之外,在你的輸入中沒有'id.thomas'列 - 所以它在輸出中是如何出現的? – MaxU

+0

謝謝MaxU,我剛剛編輯。我在自己前進着。 –

回答

2

試試這個:

pvt = pd.pivot_table(df, index=df.index, columns='cmte_id', 
        values='amount', aggfunc='sum', fill_value=0) 

保留其他列:

In [213]: pvt = pd.pivot_table(df.reset_index(), index=['index','trans','entity','st', 'fec_id'], 
    .....:      columns='cmte_id', values='amount', aggfunc='sum', fill_value=0) \ 
    .....:   .reset_index() 

In [214]: pvt 
Out[214]: 
cmte_id  index trans entity st  fec_id C00112250 C00119040 \ 
0  2007-07-20 24K CCM CA C00434571   0   0 
1  2007-08-15 24K ORG DC C00431569  2000   0 
2  2007-09-24 24K CCM MA C00433136   0   0 
3  2007-09-26 24K CCM FL C00367680   0  1000 
4  2007-09-26 24K CCM MD C00140715   0  1000 

cmte_id C00346296 
0    1000 
1    0 
2    1000 
3    0 
4    0 

In [215]: pvt.head()['st'] 
Out[215]: 
0 CA 
1 DC 
2 MA 
3 FL 
4 MD 
Name: st, dtype: object 

UPDATE:

import pandas as pd 
import glob 


# if you don't use ['cand_id'] column - remove it from `usecols` parameter 
dfy = pd.concat([pd.read_csv(f, sep='|', low_memory=False, header=None, 
          names=['cmte_id', '2', '3', '4','5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'], 
          usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'amount', 'fec_id', 'cand_id'], 
          dtype={'date': str}) 
       for f in glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt') 
       ], 
       ignore_index=True) 

dfy['date'] = pd.to_datetime(dfy['date'], format='%m%d%Y') 

# remove not needed column ASAP in order to save memory 
del dfy['cand_id'] 

dfy = dfy[(dfy['date'].notnull()) & (dfy['date'] > '2007-01-01') & (dfy['date'] < '2014-12-31') ] 

#df = dfy.set_index(['date']) 

pvt = pd.pivot_table(dfy, index=['date','trans_typ','entity_typ','state','fec_id'], 
        columns='cmte_id', values='amount', aggfunc='sum', fill_value=0) \ 
     .reset_index() 


print(pvt.info()) 

pvt.to_excel('out.xlsx', index=False) 
+0

它的工作原理。再次感謝!我經常通過我在這個網站上的經歷感到謙卑和放鬆! –

+1

總是樂於幫助! :) – MaxU

+0

我認爲可能有一個小問題。我運行'pvt.head()[['state']]'來查看其他列仍然存在,並且出現錯誤。 'KeyError:'['state'] not in index「'這是它應該如何?如果是這樣,那麼我怎樣才能保留我的其他專欄,或者我可以保留它們(同時仍然執行此操作) –