試試這個自己:爲什麼DataFrame.loc [[1]]比df.ix [[1]]慢1800倍而比df.loc [1]慢3,500倍?
import pandas as pd
s=pd.Series(xrange(5000000))
%timeit s.loc[[0]] # You need pandas 0.15.1 or newer for it to be that slow
1 loops, best of 3: 445 ms per loop
更新:這是a legitimate bug in pandas這是在2014年左右,8月,在0.15.1大概介紹。解決方法:在使用舊版熊貓的同時等待新版本;得到一位尖端的開發人員。來自github的版本;在pandas
的版本中手動進行單行修改;暫時使用.ix
而不是.loc
。
我有480萬行的數據幀,並選擇使用.iloc[[ id ]]
(具有單個元素的列表)的單排花費489毫秒,幾乎一半的第二,比相同.ix[[ id ]]
較慢1,800x倍,比.iloc[id]
慢3,500倍(將id作爲值傳遞,而不是列表)。公平地說,.loc[list]
需要大致相同的時間,無論列表的長度,但我不想花489毫秒就可以了,特別是當.ix
是快上千倍,併產生相同的結果。據我瞭解,.ix
應該會變慢,不是嗎?
我正在使用熊貓0.15.1。 Indexing and Selecting Data的優秀教程表明.ix
在某種程度上更普遍,並且推測比.loc
和.iloc
更慢。具體而言,它說
但是,當軸是基於整數時,僅支持基於標籤的訪問和不支持位置訪問。因此,在這種情況下,通常更好的是使用.iloc或.loc來更好地顯式地使用 。
這裏是一個IPython的會話與基準:
print 'The dataframe has %d entries, indexed by integers that are less than %d' % (len(df), max(df.index)+1)
print 'df.index begins with ', df.index[:20]
print 'The index is sorted:', df.index.tolist()==sorted(df.index.tolist())
# First extract one element directly. Expected result, no issues here.
id=5965356
print 'Extract one element with id %d' % id
%timeit df.loc[id]
%timeit df.ix[id]
print hash(str(df.loc[id])) == hash(str(df.ix[id])) # check we get the same result
# Now extract this one element as a list.
%timeit df.loc[[id]] # SO SLOW. 489 ms vs 270 microseconds for .ix, or 139 microseconds for .loc[id]
%timeit df.ix[[id]]
print hash(str(df.loc[[id]])) == hash(str(df.ix[[id]])) # this one should be True
# Let's double-check that in this case .ix is the same as .loc, not .iloc,
# as this would explain the difference.
try:
print hash(str(df.iloc[[id]])) == hash(str(df.ix[[id]]))
except:
print 'Indeed, %d is not even a valid iloc[] value, as there are only %d rows' % (id, len(df))
# Finally, for the sake of completeness, let's take a look at iloc
%timeit df.iloc[3456789] # this is still 100+ times faster than the next version
%timeit df.iloc[[3456789]]
輸出:
The dataframe has 4826616 entries, indexed by integers that are less than 6177817
df.index begins with Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64')
The index is sorted: True
Extract one element with id 5965356
10000 loops, best of 3: 139 µs per loop
10000 loops, best of 3: 141 µs per loop
True
1 loops, best of 3: 489 ms per loop
1000 loops, best of 3: 270 µs per loop
True
Indeed, 5965356 is not even a valid iloc[] value, as there are only 4826616 rows
10000 loops, best of 3: 98.9 µs per loop
100 loops, best of 3: 12 ms per loop
注意,使用'[[ID]'和'[ID]'是不等價的。 '[id]'會返回一個Series,但'[[id]]'將返回一行DataFrame。 – BrenBarn
@BrenBarn,是的,這解釋了'.ix'的差異:141μs與270μs。但爲什麼'.loc [[id]]'這麼慢? – osa