2017-10-12 109 views
0

我有一個樣本數據框,有country列。在每個國家記錄的相對數量:Python熊貓隨機抽樣行

d1.groupby("country").size() 

country 
Australia  21 
Cambodia  58 
China   280 
India   133 
Indonesia  195 
Malaysia  138 
Myanmar   51 
Philippines  49 
Singapore  1268 
Taiwan   47 
Thailand  273 
Vietnam  288 

如何選擇,比如說,每個國家100個隨機樣本,如果國家有> 100個樣本? (如果該國有< = 100個樣本,則什麼也不做)。目前,我這樣做,比如說,新加坡:

names_nonsg_ls = [] 
names_sg_ls = [] 

# if the country is not SG, add it to names_nonsg_ls. 
# else, add it to names_sg_ls, which will be subsampled later. 
for index, row in d0.iterrows(): 
    if str(row["country"]) != "Singapore": 
     names_nonsg_ls.append(str(row["header"])) 
    else: 
     names_sg_ls.append(str(row["header"])) 

# Select 100 random names from names_sg_ls 
names_sg_ls = random.sample(names_sg_ls, 100) 
# Form the list of names to retain 
names_ls = names_nonsg_ls + names_sg_ls 
# create new dataframe 
d1 = d0.loc[d0["header"].isin(names_ls)] 

但手動爲具有> 100名每個國家一個新的列表只是可憐的形式,更何況,我先手工挑選出的國家> 100個名字。

回答

0

可以通過國家基,然後樣品基於組大小:

d1.groupby("country", group_keys=False).apply(lambda g: g.sample(100) if len(g) > 100 else g) 

df = pd.DataFrame({ 
    'A': ['a','a','a','a','b','b','b','c','d'], 
    'B': list(range(9)) 
}) 

df.groupby('A', group_keys=False).apply(lambda g: g.sample(3) if len(g) > 3 else g) 
# A B 
#2 a 2 
#0 a 0 
#1 a 1 
#4 b 4 
#5 b 5 
#6 b 6 
#7 c 7 
#8 d 8