2014-11-05 388 views
0

我正在讀取兩個csv文件,從特定列中選擇數據,丟棄NA/Null,然後使用適合某個條件的數據在一個文件中打印另一個文件中的相關數據:熊貓索引跳過值

data1 = pandas.read_csv(filename1, usecols = ['X', 'Y', 'Z']).dropna() 
data2 = pandas.read_csv(filename2, usecols = ['X', 'Y', 'Z']).dropna() 
i=0 
for item in data1['Y']: 
    if item > -20: 
     print data2['X'][i] 

但是,這將引發我一個錯誤:

File "hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7035) 
File "hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6976) 
KeyError: 6L 

原來,當我print data2['X']我看到失蹤數行的索引

0 -1.953779 
1 -2.010039 
2 -2.562191 
3 -2.723993 
4 -2.302720 
5 -2.356181 
7 -1.928778 
... 

我該如何解決這個問題並重新編號索引值?或者,還有更好的方法?

回答

1

發現在另一個問題的解決方案從這裏:Reindexing dataframes

.reset_index(drop=True)的伎倆!

0 -1.953779 
1 -2.010039 
2 -2.562191 
3 -2.723993 
4 -2.302720 
5 -2.356181 
6 -1.928778 
7 -1.925359 
1

你的兩個文件/數據幀的長度是否相同?如果是這樣,你可以利用布爾口罩,做到這一點(它可以避免for循環):

data2['X'][data1['Y'] > -20] 

編輯:在迴應評論

什麼之間發生在:

In [16]: df1 
Out[16]: 
    X Y 
0 0 0 
1 1 2 
2 2 4 
3 3 6 
4 4 8 

In [17]: df2 
Out[17]: 
    Y X 
0 64 75 
1 65 73 
2 36 44 
3 13 58 
4 92 54 

# creates a pandas Series object of True/False, which you can then use as a "mask" 
In [18]: df2['Y'] > 50 
Out[18]: 
0  True 
1  True 
2 False 
3 False 
4  True 
Name: Y, dtype: bool 

# mask is applied element-wise to (in this case) the column of your DataFrame you want to filter 
In [19]: df1['X'][ df2['Y'] > 50 ] 
Out[19]: 
0 0 
1 1 
4 4 
Name: X, dtype: int64 

# same as doing this (where mask is applied to the whole dataframe, and then you grab your column 
In [20]: df1[ df2['Y'] > 50 ]['X'] 
Out[20]: 
0 0 
1 1 
4 4 
Name: X, dtype: int64 
+0

所以它會返回與data1 ['Y']中的值大於-20相同索引的data2 ['X']中的所有值?絕對比我的循環方法更清潔。感謝分享,總是很好地瞭解新的/不同的方法 – stoves 2014-11-06 18:25:28

+0

@stoves查看編輯 – 2014-11-06 18:52:09