2016-03-01 202 views
0

我有一個數據幀,看起來像:彙總數據

respondent_id,group_number,member_id 
1,1,3 
1,1,4 
1,2,1 
.... 

我的目標是輸出兩個計數每位受訪者ID;包括他們自己作爲成員ID的組的數量,以及那些不包含它們的組的數量。

例如,上表將輸出:

respondent_id,my_groups,other_groups 
1,1,1 

我最好的猜測是做這樣的事情:

rg_g = df.groupby(['respondent_id','group_number']) 
rg_g.apply(lambda g: g.respondent_id in g.id.values) 

但我不知道從哪裏裏去。

回答

1

更新答案(這是不是最好的代碼,但它的工作原理):

初始化:

test_data = pd.DataFrame(np.random.randint(5, size=(10, 3)),columns=['respondent_id','group_number','member_id']) 
test_data['member_id'][3]=None 
test_data['member_id'][5]=None 
test_data['member_id'][7]=None 
test_data['member_id'][8]=None 
test_data['member_id'][9]=None 
test_data['member_id'][10]=None 

代碼:

# calculate the groups where respondent have the member_id 
d_nn = test_data[test_data.member_id.notnull()] 
# or for example: test_data[test_data.member_id != 0] 
d_is_n = test_data[test_data.member_id.isnull()] 
d_nn = pd.DataFrame({'count' : d_nn.groupby([ "respondent_id","group_number"]).size()}).reset_index() 
d_is_n = pd.DataFrame({'count' : d_is_n.groupby([ "respondent_id","group_number"]).size()}).reset_index() 
d_nn['is_member'] = 1 
d_is_n['is_member'] = 0 


# merge 
result = d_nn.copy() 
for idx1 in range(len(d_is_n)): 
    merge = True 
    for idx2 in range(len(d_nn)): 
     if d_nn.iloc[idx2].respondent_id == d_is_n.iloc[idx1].respondent_id and \ 
      d_nn.iloc[idx2].group_number == d_is_n.iloc[idx1].group_number: 
      merge = False 
    if merge: 
     temp_d = d_is_n.iloc[idx1] 
     result = result.append(temp_d, ignore_index=True) 

#group by respondent_id and is_member 
result = pd.DataFrame({'group_number' : result.groupby([ "respondent_id", "is_member"]).size()}).reset_index() 
print result 
+0

這真的很接近我在找什麼。但是,我需要篩選列表,以便計算具有respondent_id的組和不支持的組。 – Jeremy

1

所以,在這裏就是我終於實現了。也許不理想,但它似乎工作。 :)

import pandas as pd 
rg = pd.read_csv('./in_file.csv') 
rg_g = rg.groupby(['respondent_id','group_number']) 
in_g = rg_g.filter(lambda g: g.respondent_id in g.id.values) 
out_g = rg_g.filter(lambda g: g.respondent_id not in g.id.values) 
my_count = in_g.groupby('respondent_id').group_number.nunique() 
other_count = out_g.groupby('respondent_id').group_number.nunique() 
pd.concat([my_count,other_count], axis=1).to_csv('./out_file.csv')