2016-02-03 94 views
3

它實際上是一個解聚,因爲我有一個數據集的結構是這樣的:大熊貓:聚合基礎上開始/結束日期

id type first_year last_year 
A t1  2009   2014 
A t1  2010   2015 
B t1  2007   2009 
B t2  2008   2011 

但我需要通過ID /年聚集並具有重疊的開始/結束條目。

數據是在像這樣一個熊貓數據幀:

test_frame = pd.DataFrame([['A','t1',2009,2014], 
         ['A','t1',2010,2015], 
         ['B','t1',2007,2009], 
         ['B','t2',2008,2011]], 
         columns = ['id','type','first_year','last_year']) 

我希望能得到一些不同的方式返回的數據:

id year count 
A 2009 1 
A 2010 2 
A 2011 2 
... 
B 2007 1 
B 2008 2 
B 2009 1 

也許是這樣的:

id year type count 
A 2009 t1  1 
A 2010 t1  2 
A 2011 t1  2 
... 
B 2007 t1  1 
B 2008 t1  1 
B 2008 t2  1 
B 2009 t2  1 
B 2010 t2  1 

這基本上適用於第一種方法,但正如您可以想象的,使用itertuples處理大量數據的速度很慢組。還有更多熊貓的方式嗎?

out_frame = pd.DataFrame(columns = ['id','type','year']) 
for rows in test_frame.itertuples(): 
    for year in range(int(rows[3]),int(rows[4])): 
     d2 = pd.DataFrame({'id': [rows[1]],'year': [year]},columns = ['id','year']) 
     out_frame = out_frame.append(d2) 
output1 = out_frame.groupby(['id','year'])['year'].count() 
output1 

回答

2

您可以使用stackresample

import pandas as pd 

test_frame = pd.DataFrame([['A','t1',2009,2014], 
         ['A','t1',2010,2015], 
         ['B','t1',2007,2009], 
         ['B','t2',2008,2011]], 
         columns = ['id','type','first_year','last_year']) 

print test_frame 
    id type first_year last_year 
0 A t1  2009  2014 
1 A t1  2010  2015 
2 B t1  2007  2009 
3 B t2  2008  2011 

#stack df, drop and rename column year 
test_frame = test_frame.set_index(['id','type'], append=True).stack().reset_index(level=[1,2,3]) 
test_frame = test_frame.drop('level_3', axis=1).rename(columns={0:'year'}) 
#convert year to datetime 
test_frame['year'] = pd.to_datetime(test_frame['year'], format="%Y") 
print test_frame 
    id type  year 
0 A t1 2009-01-01 
0 A t1 2014-01-01 
1 A t1 2010-01-01 
1 A t1 2015-01-01 
2 B t1 2007-01-01 
2 B t1 2009-01-01 
3 B t2 2008-01-01 
3 B t2 2011-01-01 
#resample and fill missing data 
out_frame = test_frame.groupby(test_frame.index).apply(lambda x: x.set_index('year').resample('1AS', how='first',fill_method='ffill')).reset_index(level=1) 
print out_frame 
     year id type 
0 2009-01-01 A t1 
0 2010-01-01 A t1 
0 2011-01-01 A t1 
0 2012-01-01 A t1 
0 2013-01-01 A t1 
0 2014-01-01 A t1 
1 2010-01-01 A t1 
1 2011-01-01 A t1 
1 2012-01-01 A t1 
1 2013-01-01 A t1 
1 2014-01-01 A t1 
1 2015-01-01 A t1 
2 2007-01-01 B t1 
2 2008-01-01 B t1 
2 2009-01-01 B t1 
3 2008-01-01 B t2 
3 2009-01-01 B t2 
3 2010-01-01 B t2 
3 2011-01-01 B t2 

#convert to year 
out_frame['year'] = out_frame['year'].dt.year 
output1 = out_frame.groupby(['id','year', 'type'])['year'].count().reset_index(name='count') 
print output1 
    id year type count 
0 A 2009 t1  1 
1 A 2010 t1  2 
2 A 2011 t1  2 
3 A 2012 t1  2 
4 A 2013 t1  2 
5 A 2014 t1  2 
6 A 2015 t1  1 
7 B 2007 t1  1 
8 B 2008 t1  1 
9 B 2008 t2  1 
10 B 2009 t1  1 
11 B 2009 t2  1 
12 B 2010 t2  1 
13 B 2011 t2  1 
output2 = out_frame.groupby(['id','year'])['year'].count().reset_index(name='count') 
print output2 
    id year count 
0 A 2009  1 
1 A 2010  2 
2 A 2011  2 
3 A 2012  2 
4 A 2013  2 
5 A 2014  2 
6 A 2015  1 
7 B 2007  1 
8 B 2008  2 
9 B 2009  2 
10 B 2010  1 
11 B 2011  1 
+0

不錯的解決方案! – MaxU

0

我的答案是基於expandi使用PeriodIndex顯示年份列。然後取這個週期範圍和堆棧(旋轉)它以便它創建多級索引。

def expand_period(row): 
    "Creates a series from first to last year and appends it to the row" 
    p = pd.period_range(row["first_year"], row["last_year"], freq="A") 
    return row.append(p.to_series()).drop(["first_year","last_year"]) 

#original data frame 
tf = pd.DataFrame([['A','t1',2009,2014], 
        ['A','t1',2010,2015], 
        ['B','t1',2007,2009], 
        ['B','t2',2008,2011]], 
        columns = ['id','type','first_year','last_year']) 

#Drop the year columns but replace them with expanded series 
tfexpanded = tf.apply(expand_period, 1).set_index(["id","type"]) 
#Rotate the axis so that you have a 3 level index 
tfindexed = tfexpanded.stack() 
#This is not necessary but improves readability when watching the output 
tfindexed[:] = 1 
#Group-By as you did before 
answer = tfindexed.groupby(level=[0,1,2]).count() 

當然,整個事情可以簡化使用lambdas和鏈接的方法。

相關問題