2014-11-06 40 views
2

我有一個包含多個標識的數據框。我想爲每個唯一的標識符組合創建一個新的「組標識符」 - 稍後,我想使用statsmodels運行迴歸。也就是說,說我有爲迴歸創建組標識

id1 id2 id3 
    A 1 100 
    A 1 101 
    B 1 100 
    B 1 100 

我想

id1 id2 id3 groupid 
    A 1 100  0 
    A 1 101  1 
    B 1 100  2 
    B 1 100  2 

id1id2id3作爲組標識符。我知道我可以得到unique()以獲得唯一的組,但是如何有效地將行編碼到它們所屬的唯一組中?

調整@伯尼的回答,以適應潛在的「NaN的:

# get a DataFrame with just the unique "keys" 
df2 = df.replace(np.NaN, -1) 
g = df2.groupby([u'id1',u'id2',u'id3']) 
gdf = pd.DataFrame(g.groups.keys(),columns=df.columns) 
gdf = gdf.replace(-1, np.NaN) 
# an idea is to re-use the index as the 'group_id' 
# the next three commands support that 
gdf.sort([u'id1',u'id2',u'id3'],inplace=True) 
gdf.reset_index(drop=True,inplace=True) 
gdf['group_id'] = gdf.index 

# merge on the three id columns 
mdf = df.merge(gdf,how='inner',on=df.columns.tolist()) 

回答

1

肯定有有無數的解決方案。這是我來到...

>>> df 
    id1 id2 id3 
0 A 1 100 
1 A 1 101 
2 B 1 100 
3 B 1 100 

# get a DataFrame with just the unique "keys" 
g = df.groupby([u'id1',u'id2',u'id3']) 
gdf = pd.DataFrame(g.groups.keys(),columns=df.columns) 

# an idea is to re-use the index as the 'group_id' 
# the next three commands support that 
gdf.sort([u'id1',u'id2',u'id3'],inplace=True) 
gdf.reset_index(drop=True,inplace=True) 
gdf['group_id'] = gdf.index 

# merge on the three id columns 
mdf = df.merge(gdf,how='inner',on=df.columns.tolist()) 

產地:

 
    id1 id2 id3 group_id 
0 A 1 100   0 
1 A 1 101   1 
2 B 1 100   2 
3 B 1 100   2 
+0

我喜歡這個答案,因爲我可以沿着列的列表傳 - 也是,我希望它是更有效。 – FooBar 2014-11-07 00:35:53

+1

您應該將第一行替換爲以下內容:'df2 = df.replace(np.NaN,-1)','g = df2.groupby(...)'。 'groupby'與'NaN'無法很好地工作(「如預期的那樣」),並會爲每個'NaN'值創建一個單獨的組。 – FooBar 2014-11-07 19:39:32

+0

謝謝,@FooBar。我更新了答案。放下NaN是否適合你的目的? – bernie 2014-11-07 20:01:37

1

這是你在找什麼?

df = pd.DataFrame({'id1': ['A','A','B','B'],'id2':[1,1,1,1],'id3':[100,101,100,100]}) 

def makegroup(x,y,z): 
    return str(x) + str(y) + str(z) 

df['groupid'] = df.apply(lambda row: makegroup(row['id1'], row['id2'], row['id3']), axis=1) 

groupiddict = {} 
groupincrimenter = 1 

for x in df['groupid'].unique(): 
    groupiddict[x] = groupincrimenter 
    groupincrimenter += 1 

df['groupidINT'] = df.apply(lambda row: int(groupiddict[row['groupid']]), axis=1) 

這裏是輸出:

id1 id2 id3 groupid groupidINT 
0 A 1 100 A1100   1 
1 A 1 101 A1101   2 
2 B 1 100 B1100   3 
3 B 1 100 B1100   3