2014-05-10 27 views
0

我正在使用熊貓來處理大量數據。我想找到獲得第一行數據幀ID爲熊貓在DataFrame中獲得第一排,條件爲

I have 2 DataFrame: 

school_detail 
school_id detail1 detail2 
1   d11  d21 
2   d12  d22 
2   d13  d23 
4   d14  d24 
... 
It has more than 20 million rows 

schools 
id school_name 
1 name1 
2 name2 
3 name3 
4 name4 
... 
It has 3 million rows 

我需要遍歷school_detail所有行的每一行設置類型最快的方式。

def get_type(s_detail): 
    # I need to get school name here to calculate the type so I use 
    school = schools[schools.id == s_detail.school_id] # To get school by id 

school_detail['type'] = school_detail.apply(lambda x: get_type(x), axis=1) 

我使用%PRUN檢查時間由功能得到ID學校。它是關於0.03秒

當我運行10000行 school_detail。它需要43秒

如果我運行20密爾行。這可能需要幾個小時。

我的問題:

我想找到更好的方式獲得通過ID學校使其運行速度更快。

id列是唯一的。熊貓在本專欄中使用二進制搜索嗎?

+0

你可以試試[GROUPBY](http://pandas.pydata.org/pandas-docs/stable/groupby.html)函數,而不是測試' '[schools.id == s_detail.school_id]'' –

+0

你的get_type函數令人困惑,因爲它不返回任何東西。你是否試圖獲得一個具有schoolid,name,detail1,detail2的數據框? – cwharland

+0

好像你想要的只是加入 – cwharland

回答

0

下面是如何做到這一點的例子。它在大型數據集上應該很快,因爲它不使用任何循環或特定功能。它使用熊貓loc功能。

import pandas as pd 
from StringIO import StringIO 

data_school_detail = \ 
"""school_id,detail1,detail2 
1,d11,d21 
2,d12,d22 
2,d13,d23 
4,d14,d24""" 

data_schools = \ 
"""id,school_name 
1,name1 
2,name2 
3,name3 
4,name4""" 

# Creation of the dataframes 
school_detail = pd.read_csv(StringIO(data_school_detail),sep = ',') 
schools  = pd.read_csv(StringIO(data_schools),sep = ',', index_col = 0) 
# Create a dataframe containing the schools data to be applied on 
# dataframe school_detail 
res = schools.loc[school_detail['school_id']] 
# Reset index with school_detail index 
res.index = school_detail.index 
# Rename column as presented in the question 
res.columns = ['type'] 
# Add the columns to dataframe school_detail 
school_detail = school_detail.join(res) 

school_detail現在包含

school_id detail1 detail2 type 
0   1  d11  d21 name1 
1   2  d12  d22 name2 
2   2  d13  d23 name2 
3   4  d14  d24 name4 
+0

我沒有使用.loc函數。我通過school_id合併了兩張表,以獲得school_detail中的school_name。在我更改我的代碼後,它只需要四分之一的時間。 –