2017-08-21 51 views
0

我有兩個大dataframes df1 --> 100K行和df2 --> 600K行,他們看起來像下面最快的搜索使用熊貓的Python

# df1 
             name price brand model 
0  CANON CAMERA 20 FS36dINFS MEGAPIXEL 9900.0 CANON FS36dINFS  
1    SONY HD CAMERA 25 MEGAPIXEL 8900.0 SONY   
2  LG 55" 4K UHD LED Smart TV 55UJ635V 5890.0  LG 55UJ635V  
3  Sony 65" LED Smart TV KD-65XD8505BAE 4790.0 SONY KD-65XD8505BAE  
4  LG 49" 4K UHD LED Smart TV 49UJ651V 4390.0  LG 49UJ651V  

#df2 

            name  store  price 
0  LG 49" 4K UHD LED Smart TV 49UJ651V  storeA 4790.0 
1    SONY HD CAMERA 25 MEGAPIXEL  storeA 12.90 
2 Samsung 32" LED Smart TV UE-32J4505XXE  storeB 1.30 

我想匹配,如果在兩個大dataframes匹配的字符串的方法品牌和DF1其他功能是在DF2,如果他們存在,那麼我做一些事情。目前我使用過兩個dataframes迭代像下面

datalist = [] 
for idx1, row1 in df1.iterrow(): 
    for idx2, row2 in df2.iterrows(): 
     if(row1['brand'] in row2['name'] and row1['model'] in row2['name']): 
       datalist.append([row1['model'], row1['brand'], row1['name'], row1['price'], row2['name'],row2['price'], row2['store']]) 

的天真的做法但這服用大量的時間,因爲這兩個dataframes都大。我研究過集合更快,但在這裏,我使用數據框的方式使用iterrows,我無法轉換爲集合,因爲那樣我就失去了位置。有沒有更快的做到這一點?

回答

2

如果在df1['brand']df1['model']大量的重複,那麼你可能通過創建品牌和型號的正則表達式模式提升性能:

brands = '({})'.format('|'.join(df1['brand'].dropna().unique())) 
# '(CANON|SONY|LG)' 
models = '({})'.format('|'.join(df1['model'].dropna().unique())) 
# '(FS36dINFS|55UJ635V|KD-65XD8505BAE|49UJ651V)' 

然後你可以使用str.extract方法找到品牌和從df2['name']模型的字符串:

df2['brand'] = df2['name'].str.extract(brands, expand=False) 
df2['model'] = df2['name'].str.extract(models, expand=False) 

然後,你可以通過執行內合併獲得以數據幀形式的期望的數據:

result = pd.merge(df1.dropna(subset=bm), df2.dropna(subset=bm), on=bm, how='inner') 

import re 
import sys 
import pandas as pd 
pd.options.display.width = sys.maxsize 

df1 = pd.DataFrame({'brand': ['CANON', 'SONY', 'LG', 'SONY', 'LG'], 'model': ['FS36dINFS', None, '55UJ635V', 'KD-65XD8505BAE', '49UJ651V'], 'name': ['CANON CAMERA 20 FS36dINFS MEGAPIXEL', 'SONY HD CAMERA 25 MEGAPIXEL', 'LG 55" 4K UHD LED Smart TV 55UJ635V', 'Sony 65" LED Smart TV KD-65XD8505BAE', 'LG 49" 4K UHD LED Smart TV 49UJ651V'], 'price': [9900.0, 8900.0, 5890.0, 4790.0, 4390.0]}) 

df2 = pd.DataFrame({'name': ['LG 49" 4K UHD LED Smart TV 49UJ651V', 'SONY HD CAMERA 25 MEGAPIXEL', 'Samsung 32" LED Smart TV UE-32J4505XXE'], 'price': [4790.0, 12.9, 1.3], 'store': ['storeA', 'storeA', 'storeB']}) 

bm = ['brand','model'] 
for col in bm: 
    keywords = [re.escape(item) for item in df1[col].dropna().unique()] 
    pat = '({})'.format('|'.join(keywords)) 
    df2[col] = df2['name'].str.extract(pat, expand=False) 
result = pd.merge(df1.dropna(subset=bm), df2.dropna(subset=bm), on=bm, how='inner') 
print(result) 

產量

brand  model        name_x price_x        name_y price_y store 
0 LG 49UJ651V LG 49" 4K UHD LED Smart TV 49UJ651V 4390.0 LG 49" 4K UHD LED Smart TV 49UJ651V 4790.0 storeA 
+0

當我跑的代碼,它引發'sre_constants.error:在位置515858'多個重複在線路'DF2 [COL] = DF2 [」 name']。str.extract(pat,expand = False)'...我認爲'515858'行的數據存在一些問題,所以現在我刪除了這一行,但它仍然繼續拋出相同的錯誤相同位置 – muazfaiz

+0

「品牌」或「模型」中可能存在某些字符,例如背斜字符對正則表達式引擎有特殊意義,而我們希望字符被視爲字符串文字。解決方法是對我們想要解釋爲字符串文字的字符串調用're.escape'。我已更新代碼以顯示我的意思。 – unutbu

+0

非常感謝!我花了很多時間來解決這個問題 – muazfaiz