爲迴歸創建組標識

我有一個包含多個標識的數據框。我想爲每個唯一的標識符組合創建一個新的「組標識符」 - 稍後，我想使用statsmodels運行迴歸。也就是說，說我有爲迴歸創建組標識

id1 id2 id3 
    A 1 100 
    A 1 101 
    B 1 100 
    B 1 100

我想

id1 id2 id3 groupid 
    A 1 100  0 
    A 1 101  1 
    B 1 100  2 
    B 1 100  2

與id1，id2，id3作爲組標識符。我知道我可以得到unique()以獲得唯一的組，但是如何有效地將行編碼到它們所屬的唯一組中？

調整@伯尼的回答，以適應潛在的「NaN的：

# get a DataFrame with just the unique "keys" 
df2 = df.replace(np.NaN, -1) 
g = df2.groupby([u'id1',u'id2',u'id3']) 
gdf = pd.DataFrame(g.groups.keys(),columns=df.columns) 
gdf = gdf.replace(-1, np.NaN) 
# an idea is to re-use the index as the 'group_id' 
# the next three commands support that 
gdf.sort([u'id1',u'id2',u'id3'],inplace=True) 
gdf.reset_index(drop=True,inplace=True) 
gdf['group_id'] = gdf.index 

# merge on the three id columns 
mdf = df.merge(gdf,how='inner',on=df.columns.tolist())

來源

2014-11-06 FooBar

肯定有有無數的解決方案。這是我來到...

>>> df 
    id1 id2 id3 
0 A 1 100 
1 A 1 101 
2 B 1 100 
3 B 1 100 

# get a DataFrame with just the unique "keys" 
g = df.groupby([u'id1',u'id2',u'id3']) 
gdf = pd.DataFrame(g.groups.keys(),columns=df.columns) 

# an idea is to re-use the index as the 'group_id' 
# the next three commands support that 
gdf.sort([u'id1',u'id2',u'id3'],inplace=True) 
gdf.reset_index(drop=True,inplace=True) 
gdf['group_id'] = gdf.index 

# merge on the three id columns 
mdf = df.merge(gdf,how='inner',on=df.columns.tolist())

產地：

 
    id1 id2 id3 group_id 
0 A 1 100   0 
1 A 1 101   1 
2 B 1 100   2 
3 B 1 100   2

來源

2014-11-07 00:04:36 bernie

我喜歡這個答案，因爲我可以沿着列的列表傳 - 也是，我希望它是更有效。 – FooBar 2014-11-07 00:35:53

您應該將第一行替換爲以下內容：'df2 = df.replace（np.NaN，-1）'，'g = df2.groupby（...）'。 'groupby'與'NaN'無法很好地工作（「如預期的那樣」），並會爲每個'NaN'值創建一個單獨的組。 – FooBar 2014-11-07 19:39:32

謝謝，@FooBar。我更新了答案。放下NaN是否適合你的目的？ – bernie 2014-11-07 20:01:37

這是你在找什麼？

df = pd.DataFrame({'id1': ['A','A','B','B'],'id2':[1,1,1,1],'id3':[100,101,100,100]}) 

def makegroup(x,y,z): 
    return str(x) + str(y) + str(z) 

df['groupid'] = df.apply(lambda row: makegroup(row['id1'], row['id2'], row['id3']), axis=1) 

groupiddict = {} 
groupincrimenter = 1 

for x in df['groupid'].unique(): 
    groupiddict[x] = groupincrimenter 
    groupincrimenter += 1 

df['groupidINT'] = df.apply(lambda row: int(groupiddict[row['groupid']]), axis=1)

這裏是輸出：

id1 id2 id3 groupid groupidINT 
0 A 1 100 A1100   1 
1 A 1 101 A1101   2 
2 B 1 100 B1100   3 
3 B 1 100 B1100   3

來源

2014-11-06 23:47:07

爲迴歸創建組標識

回答

相關問題