熊貓Multiindex：我做錯了什麼？

-1

我有一個程序，我有一對大型熊貓數據框的成對交互（行），我隨機走過。對於每個連續步選項列表是由特定值的整個數據幀中的兩列，所以基本上收窄，熊貓Multiindex：我做錯了什麼？

df_options = df[(df.A == x) & (df.B == y)]

我有事兒使用語法類似上面的工作，但它似乎這將是一個在速度方面很好的想法（這是限制），由A股指數DF，B像這樣：

df.sort(['A', 'B'], inplace=True) 
df.index = df.index.rename('idx') 
df = df.set_index(['A', 'B'], drop=False, append=True, verify_integrity=True)

（注意我保持原來的指標作爲「IDX」，因爲那是我是如何記錄隨機走路和訪問特定的行）
那麼我用原來的df_options代碼替換，首先，
df.xs((x, y), level=('A', 'B'))
，並與有問題後，
df.loc(axis=0)[:,A,B]

而且，當我需要特定值，原來的語法從

df_options.loc[new, 'sim']

改爲

df_options.xs(new, level='idx')['sim'].values[0]

或

df_options.loc(axis=0)[new,:,:]['sim'].values[0]

（「新」是DF的隨機選擇下一個索引，以及「SIM」的配對相似性得分的一列。）

正如我砍死走試圖得到這個工作，我一直得到類似的錯誤'...not hashable'和AttributeError: 'Int64Index' object has no attribute 'get_loc_level

這使我想到標題中的問題：我做錯了什麼？更具體地說：
1）multiindex是否真的有可能像我想的那樣加快這個過程？，
2）如果是這樣，這裏使用的是什麼正確的習慣用法（感覺就像我用.xs和.loc），
3）還是應該使用其他類似原始numpy的東西？

編輯在使用代碼創建示例的過程中，我設法使其工作。我會說，我不得不跳過一些尷尬的籃球圈，如和df.index[rand_pair][0][0]。

迴應傑夫：大熊貓0.14.1

df.info() 
<class 'pandas.core.frame.DataFrame'> 
MultiIndex: 561567 entries, (0, 0, 003) to (561566, 26127, 011) 
Data columns (total 14 columns): 
p1    561567 non-null int64 
smp1   561567 non-null object 
rt1    561567 non-null float64 
cas1   561567 non-null object 
sim1   561567 non-null float64 
p2    561567 non-null int64 
smp2   561567 non-null object 
rt2    561567 non-null float64 
cas2   561567 non-null object 
sim2   561567 non-null float64 
nlsim1   561567 non-null float64 
sum_spec_sq1 561567 non-null float64 
sum_spec_sq2 561567 non-null float64 
sum_s1s2  561567 non-null float64 
dtypes: float64(8), int64(2), object(4)

注：「P1」，「SMP2」和「nlsim1」對應於「A」「B」和「SIM」我上面的問題。足夠的數據來行走幾個步驟：

df = pd.DataFrame({u'nlsim1': {174513: 0.8782, 270870: 0.9461, 478503: 0.8809}, 
u'p1': {174513: 8655, 270870: 13307, 478503: 22276}, 
u'p2': {174513: 13307, 270870: 22276, 478503: 2391}, 
u'smp1': {174513: u'007', 270870: u'010', 478503: u'016'}, 
u'smp2': {174513: u'010', 270870: u'016', 478503: u'002'}}) 
df.index = df.index.rename('idx') 
df = df.set_index(['p1', 'smp2'], drop=False, append=True, verify_integrity=True) 

def weighted_random_choice(): 
    options = df_options.index.tolist() 
    tot = df_options.nlsim1.sum() 
    options_weight = df_options.nlsim1/tot 
    return np.random.choice(options, p=list(options_weight))

發起步行：

samples = set([c for a, b, c in df.index.values]) 
df_numbered = range(df.shape[0]) 
#rand_pair = random.sample(df_numbered, 1) 
rand_pair = [0] 
path = [df.index[rand_pair][0][0]]

步行（迭代它）：

row = df.loc[path[-1],:,:] 
p = row.p2.values[0] 
smp = row.smp2.values[0] 
print p, smp 
samples.discard(smp) 
print sorted(list(samples)) 
pick_sample = random.sample(samples, 1)[0] 
print pick_sample 
df_options = df.xs((p, pick_sample), level=('p1', 'smp2')) 
if df_options.shape[0] < 1: 
    print "out of options, stop iterating" 
    print "path=", path 
else: 
    print "# options: ", df_options.shape[0] 
    new = weighted_random_choice() 
    path.append(new) 
    print path 
    print "you should keep going"

輸出，第一步驟：

13307 010 
[u'002', u'016'] 
016 
# options: 1 
[174513, 270870] 
you should keep going

2nd步驟：

22276 016 
[u'002'] 
002 
# options: 1 
[174513, 270870, 478503] 
you should keep going

如預期的第三步錯誤B/C它用完了樣品。

來源

2014-10-06 Nathan Lloyd

好吧，您應該首先展示您的輸入和輸出的完整複製/可移動示例，您真實數據集上的df.info（）和pandas版本。 – Jeff 2014-10-06 17:11:10

@Jeff我正在爲您編寫可執行的代碼。我的猜測是我完全做了一件事，顯然是錯誤的，但你對細節的要求給了我希望...... – 2014-10-06 21:16:24

好吧，現在只需要一個輸入框架樣本和你期望的輸出（應該通過複製/粘貼和可運行;只需生成隨機數據，但索引應與您所做的內容保持一致）。 df.info（）給出了你的真實數據的概念。 – Jeff 2014-10-06 22:51:42

那麼，簡單的解決是使用數據幀時，原始和另一個由「A」和「B」索引的兩個副本：

dfi = df.set_index(['A', 'B'])

通過改變「中選擇特定的A，B 「從

df_options = df[(df.A == x) & (df.B == y)]

範式

df_options = dfi.loc(axis=0)[x, y]

我能夠獲得速度提高了5倍。它應該隨着df的大小而更好地擴展。

來源

2014-10-17 23:35:55

熊貓Multiindex：我做錯了什麼？

回答

相關問題