2016-06-24 20 views
1

我使用DataFrame.query()來查找行,並且遇到了一個問題,我只能在從CSV加載數據時進行復制。如果我在純Python中創建了我認爲是相同的DataFrame,則query()按預期工作。用read_csv()創建的DataFrame給出意外的查詢()結果

這是數據的CSV:

,ASK_PRICE,ASK_QTY,BID_PRICE,BID_QTY 
2016-06-17 16:38:00.043,104.258,50.0,104.253,100.0 
2016-06-17 16:38:00.043,104.259,100.0,104.253,100.0 
2016-06-17 16:38:02.978,104.259,100.0,104.254,50.0 
2016-06-17 16:38:03.999,104.259,100.0,104.253,50.0 
2016-06-17 16:38:03.999,104.259,100.0,104.251,150.0 
2016-06-17 16:38:04.001,104.259,100.0,104.251,100.0 

而這是表示該問題的示例腳本:

#!/usr/bin/env python 
import pandas as pd 
import numpy as np 
from datetime import datetime 

timestamp = [ 
     datetime.strptime('2016-06-17 16:38:00.043', '%Y-%m-%d %H:%M:%S.%f'), 
     datetime.strptime('2016-06-17 16:38:00.043', '%Y-%m-%d %H:%M:%S.%f'), 
     datetime.strptime('2016-06-17 16:38:02.978', '%Y-%m-%d %H:%M:%S.%f'), 
     datetime.strptime('2016-06-17 16:38:03.999', '%Y-%m-%d %H:%M:%S.%f'), 
     datetime.strptime('2016-06-17 16:38:03.999', '%Y-%m-%d %H:%M:%S.%f'), 
     datetime.strptime('2016-06-17 16:38:04.001', '%Y-%m-%d %H:%M:%S.%f') 
     ] 
bid_price = [ 104.253, 104.253, 104.254, 104.253, 104.251, 104.251 ] 
bid_qty = [ 100.0, 100.0, 50.0, 50.0, 150.0, 100.0 ] 
ask_price = [ 104.258, 104.259, 104.259, 104.259, 104.259, 104.259 ] 
ask_qty = [ 50.0, 100.0, 100.0, 100.0, 100.0, 100.0 ] 

df1 = pd.DataFrame(index=timestamp, data={'BID_PRICE': bid_price, 
    'BID_QTY': bid_qty, 'ASK_PRICE': ask_price, 'ASK_QTY': ask_qty}) 

df2 = pd.read_csv('in.csv', index_col=0, skip_blank_lines=True) 
df2.index = pd.to_datetime(df2.index) 

print df1 
print df2 
print 
print df1.index 
print df2.index 
print 
print df1.columns 
print df2.columns 
print 
df1.reset_index(inplace=True) 
df2.reset_index(inplace=True) 

print df1 
print df2 
print 

df1m = df1.query('(BID_PRICE == 104.254) and (BID_QTY >= 50)').tail(1) 
df2m = df2.query('(BID_PRICE == 104.254) and (BID_QTY >= 50)').tail(1) 
print df1m 
print df2m 

在CSV的查詢創建數據幀失敗。據我可以看到它是相同的數據,索引和列類型,這兩個數據框之間有什麼區別?

+0

什麼數據框的樣子像在調試?打印數據幀可能不會顯示它,因爲該對象可能有一個__str __,它以掩蓋問題的方式格式化數據。 –

回答

2

這是一個well known problem of comparing float values

嘗試這樣的:

In [70]: df2.query('(abs(BID_PRICE - 104.254) < 0.000001) and (BID_QTY >= 50)') 
Out[70]: 
         ASK_PRICE ASK_QTY BID_PRICE BID_QTY 
2016-06-17 16:38:02.978 104.259 100.0 104.254  50.0 

代替:

In [72]: df2.query('(BID_PRICE == 104.254) and (BID_QTY >= 50)') 
Out[72]: 
Empty DataFrame 
Columns: [ASK_PRICE, ASK_QTY, BID_PRICE, BID_QTY] 
Index: [] 

簡單的例子:

In [73]: 2.2 * 3.0 == 6.6 
Out[73]: False 

In [74]: 3.3 * 2.0 == 6.6 
Out[74]: True 
0

我不知道答案,但它SE ems與索引列相關。 我運行了代碼的簡化版本,並按預期工作。

#!/usr/bin/env python 

import pandas as pd 

timestamp = [1, 2, 3, 4, 5, 6] 
bid_price = [104, 105, 106, 107, 107, 107] 
bid_qty = [100.0, 100.0, 50.0, 50.0, 150.0, 100.0] 

df1 = pd.DataFrame(index=timestamp, 
        data={'BID_PRICE': bid_price, 'BID_QTY': bid_qty}) 

df2 = pd.read_csv('in.csv', index_col=0, skip_blank_lines=True) 

print(df1) 
print(df2) 

df1m = df1.query('(BID_PRICE == 107) and (BID_QTY >= 50)').tail(1) 
df2m = df2.query('(BID_PRICE == 107) and (BID_QTY >= 50)').tail(1) 

print("Result 1: {}".format(df1m)) 
print("Result 2: {}".format(df2m)) 

---------------- in.csv文件內容-----------

Index,BID_PRICE,BID_QTY 
1, 104, 100.0 
2, 105, 100.0 
3, 106, 50.0 
4, 107, 50.0 
5, 107, 150.0 
6, 107, 100.0