爲什麼DataFrame.loc [[1]]比df.ix [[1]]慢1800倍而比df.loc [1]慢3,500倍？

試試這個自己：爲什麼DataFrame.loc [[1]]比df.ix [[1]]慢1800倍而比df.loc [1]慢3,500倍？

import pandas as pd 
s=pd.Series(xrange(5000000)) 
%timeit s.loc[[0]] # You need pandas 0.15.1 or newer for it to be that slow 
1 loops, best of 3: 445 ms per loop

更新：這是a legitimate bug in pandas這是在2014年左右，8月，在0.15.1大概介紹。解決方法：在使用舊版熊貓的同時等待新版本;得到一位尖端的開發人員。來自github的版本;在pandas的版本中手動進行單行修改;暫時使用.ix而不是.loc。

我有480萬行的數據幀，並選擇使用.iloc[[ id ]]（具有單個元素的列表）的單排花費489毫秒，幾乎一半的第二，比相同.ix[[ id ]]較慢1,800x倍，比.iloc[id]慢3,500倍（將id作爲值傳遞，而不是列表）。公平地說，.loc[list]需要大致相同的時間，無論列表的長度，但我不想花489毫秒就可以了，特別是當.ix是快上千倍，併產生相同的結果。據我瞭解，.ix應該會變慢，不是嗎？

我正在使用熊貓0.15.1。 Indexing and Selecting Data的優秀教程表明.ix在某種程度上更普遍，並且推測比.loc和.iloc更慢。具體而言，它說

但是，當軸是基於整數時，僅支持基於標籤的訪問和不支持位置訪問。因此，在這種情況下，通常更好的是使用.iloc或.loc來更好地顯式地使用。

這裏是一個IPython的會話與基準：

print 'The dataframe has %d entries, indexed by integers that are less than %d' % (len(df), max(df.index)+1) 
    print 'df.index begins with ', df.index[:20] 
    print 'The index is sorted:', df.index.tolist()==sorted(df.index.tolist()) 

    # First extract one element directly. Expected result, no issues here. 
    id=5965356 
    print 'Extract one element with id %d' % id 
    %timeit df.loc[id] 
    %timeit df.ix[id] 
    print hash(str(df.loc[id])) == hash(str(df.ix[id])) # check we get the same result 

    # Now extract this one element as a list. 
    %timeit df.loc[[id]] # SO SLOW. 489 ms vs 270 microseconds for .ix, or 139 microseconds for .loc[id] 
    %timeit df.ix[[id]] 
    print hash(str(df.loc[[id]])) == hash(str(df.ix[[id]])) # this one should be True 
    # Let's double-check that in this case .ix is the same as .loc, not .iloc, 
    # as this would explain the difference. 
    try: 
     print hash(str(df.iloc[[id]])) == hash(str(df.ix[[id]])) 
    except: 
     print 'Indeed, %d is not even a valid iloc[] value, as there are only %d rows' % (id, len(df)) 

    # Finally, for the sake of completeness, let's take a look at iloc 
    %timeit df.iloc[3456789] # this is still 100+ times faster than the next version 
    %timeit df.iloc[[3456789]]

輸出：

The dataframe has 4826616 entries, indexed by integers that are less than 6177817 
df.index begins with Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64') 
The index is sorted: True 
Extract one element with id 5965356 
10000 loops, best of 3: 139 µs per loop 
10000 loops, best of 3: 141 µs per loop 
True 
1 loops, best of 3: 489 ms per loop 
1000 loops, best of 3: 270 µs per loop 
True 
Indeed, 5965356 is not even a valid iloc[] value, as there are only 4826616 rows 
10000 loops, best of 3: 98.9 µs per loop 
100 loops, best of 3: 12 ms per loop

來源

2014-12-22 osa

注意，使用'[[ID]'和'[ID]'是不等價的。 '[id]'會返回一個Series，但'[[id]]'將返回一行DataFrame。 – BrenBarn

@BrenBarn，是的，這解釋了'.ix'的差異：141μs與270μs。但爲什麼'.loc [[id]]'這麼慢？ – osa

貌似這個問題是不存在的大熊貓0.14。我用line_profiler來描述它，我想我知道發生了什麼。由於熊貓0.15.1，如果給定的索引不存在，現在會產生一個KeyError。看起來像當你使用.loc[list]語法時，即使找到它，它也會沿着整個軸對索引進行徹底搜索。也就是說，首先，在發現元素的情況下不存在提前終止，其次，在這種情況下的搜索是蠻力的。

File: .../anaconda/lib/python2.7/site-packages/pandas/core/indexing.py，

1278              # require at least 1 element in the index 
    1279   1   241 241.0  0.1    idx = _ensure_index(key) 
    1280   1  391040 391040.0  99.9    if len(idx) and not idx.isin(ax).any(): 
    1281           
    1282               raise KeyError("None of [%s] are in the [%s]" %

來源

2014-12-22 06:08:59 osa

熊貓索引是瘋狂慢，我切換到numpy的索引

df=pd.DataFrame(some_content) 
# takes forever!! 
for iPer in np.arange(-df.shape[0],0,1): 
    x = df.iloc[iPer,:].values 
    y = df.iloc[-1,:].values 
# fast!   
vals = np.matrix(df.values) 
for iPer in np.arange(-vals.shape[0],0,1): 
    x = vals[iPer,:] 
    y = vals[-1,:]

來源

2017-09-11 22:57:48 citynorman

爲什麼DataFrame.loc [[1]]比df.ix [[1]]慢1800倍而比df.loc [1]慢3,500倍？

回答

相關問題