2016-11-19 72 views
2

我有兩個dataframes字符串匹配和數據之間的分配幀

(1st Dataframe) 
**Sentences** 
hello world 
live in the world 
haystack in the needle 

(2nd Dataframe in descending order by Weight) 
**Words** **Weight** 
world   80 
hello   60 
haystack  40 
needle   20 

我要檢查每個句子的第一個數據幀,如果在句子中的任何字包含在第二個數據幀,並選擇字具有最高上市字重量數字。然後,我將分配第一個數據幀中發現的最重的詞。所以結果應該是:

**Sentence**    **Assigned Word** 
hello world     world 
live in the world    world 
needle in the haystack  haystack 

我想用兩個for循環,但表現可能是,如果有幾百萬的句子或單詞的慢。什麼是在Python中做到這一點的最佳方式?謝謝!

回答

0

笛卡爾乘積 - >篩選 - >排序 - >groupby.head(1)

這種方法涉及幾個步驟,但它是最好的熊貓式的方法,我能想到的。

import pandas as pd 
import numpy as np 

list1 = ['hello world', 
'live in the world', 
'haystack in the needle'] 

list2 = [['world',80], 
     ['hello',60], 
     ['haystack',40], 
     ['needle',20]] 

df1 = pd.DataFrame(list1,columns=['Sentences']) 
df2 = pd.DataFrame(list2,columns=['Words','Weight']) 


# Creating a new column `Word_List` 
df1['Word_List'] = df1['Sentences'].apply(lambda x : x.split(' ')) 

# Need a common key for cartesian product 
df1['common_key'] = 1 
df2['common_key'] = 1 

# Cartesian Product 
df3 = pd.merge(df1,df2,on='common_key',copy=False) 

# Filtering only words that matched 
df3['Match'] = df3.apply(lambda x : x['Words'] in x['Word_List'] ,axis=1) 
df3 = df3[df3['Match']] 

# Sorting values by sentences and weight 
df3.sort_values(['Sentences','Weight'],axis=0,inplace=True,ascending=False) 

# Keeping only the first element in each group 
final_df = df3.groupby('Sentences').head(1).reset_index()[['Sentences','Words']] 
final_df 

輸出: Sentences Words 0 live in the world world 1 hello world world 2 haystack in the needle haystack

性能: 10 loops, best of 3: 41.5 ms per loop

相關問題