0
我有一個樣本數據框,有country
列。在每個國家記錄的相對數量:Python熊貓隨機抽樣行
d1.groupby("country").size()
country
Australia 21
Cambodia 58
China 280
India 133
Indonesia 195
Malaysia 138
Myanmar 51
Philippines 49
Singapore 1268
Taiwan 47
Thailand 273
Vietnam 288
如何選擇,比如說,每個國家100個隨機樣本,如果國家有> 100個樣本? (如果該國有< = 100個樣本,則什麼也不做)。目前,我這樣做,比如說,新加坡:
names_nonsg_ls = []
names_sg_ls = []
# if the country is not SG, add it to names_nonsg_ls.
# else, add it to names_sg_ls, which will be subsampled later.
for index, row in d0.iterrows():
if str(row["country"]) != "Singapore":
names_nonsg_ls.append(str(row["header"]))
else:
names_sg_ls.append(str(row["header"]))
# Select 100 random names from names_sg_ls
names_sg_ls = random.sample(names_sg_ls, 100)
# Form the list of names to retain
names_ls = names_nonsg_ls + names_sg_ls
# create new dataframe
d1 = d0.loc[d0["header"].isin(names_ls)]
但手動爲具有> 100名每個國家一個新的列表只是可憐的形式,更何況,我先手工挑選出的國家> 100個名字。