2016-11-25 46 views
0

之間的模糊匹配我的數據框A(df_cam)與CLI ID和原產地:與快捷方式和活動慢兩種DataFrames

cli id |   origin 
------------------------------------ 
123 | 1234 M-MKT XYZklm 05/2016 

而且數據框B(df_dict

shortcut |   campaign 
------------------------------------ 
M-MKT | Mobile Marketing Outbound 

我知道,示例客戶端來源1234 M-MKT XYZklm 05/2016實際上來自廣告系列Mobile Marketing Outbound,因爲它包含關鍵字M-MKT

注意,快捷方式是一般的關鍵詞,基於算法應該決定什麼。起源也可以是M-Marketing,MMKTMob-MKT。我首先通過分析所有來源手動創建了快捷方式列表。我也使用正則表達式來清除origin,然後將其提取到程序中。

我想通過快捷與運動相匹配的客戶來源和重視分數看出區別。正如下面顯示:

cli id | shortcut |   origin   |  campaign   | Score 
--------------------------------------------------------------------------------- 
123 | M-MKT | 1234 M-MKT XYZklm 05/2016 | Mobile Marketing Outbound | 0.93 

下面是我的程序,它的工作原理,但真正。 DataFrame A具有〜400.000行,另一個DataFrame B具有〜40行。

有沒有辦法讓我的速度更快?

from fuzzywuzzy import fuzz 
list_values = df_dict['Shortcut'].values.tolist() 

def TopFuzzMatch(tokenA, dict_, position, value): 
    """ 
    Calculates similarity between two tokens and returns TOP match and score 
    ----------------------------------------------------------------------- 
    tokenA: similarity to this token will be calculated 
    dict_a: list with shortcuts 
    position: whether I want first, second, third...TOP position 
    value: 0=similarity score, 1=associated shortcut 
    ----------------------------------------------------------------------- 
    """ 
    sim = [(fuzz.token_sort_ratio(x, tokenA),x) for x in dict_] 
    sim.sort(key=lambda tup: tup[0], reverse=True) 
    return sim[position][value] 

df_cam['1st_choice_short'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,1), axis=1) 
df_cam['1st_choice_sim'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,0), axis=1) 

請注意,我還想計算第2次和第3次最佳匹配以評估準確性。

編輯

我發現process.ExtractOne方法,但速度保持不變。 所以我的代碼看起來像現在這樣:

def TopFuzzMatch(token, dict_, value): 
    score = process.extractOne(token, dict_, scorer=fuzz.token_sort_ratio) 
    return score[value] 

回答

1

我找到了解決辦法 - 在我乾淨的正則表達式起源柱(沒有數字和特殊字符),也有隻有幾百重複不同的值,所以我計算Fuzz算法就在那些上,這大大縮短了時間。

def TopFuzzMatch(df_cam, df_dict): 
    """ 
    Calculates similarity bewteen two tokens and return TOP match 
    The idea is to do it only over distinct values in given DF (takes ages otherwise) 
    ----------------------------------------------------------------------- 
    df_cam: DataFrame with client id and origin 
    df_dict: DataFrame with abbreviation which is matched with the description i need 
    ----------------------------------------------------------------------- 
    """ 
    #Clean special characters and numbers 
    df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1) 

    #Get unique values and calculate similarity 
    uq_origin = np.unique(df_cam['clean_camp'].values.ravel()) 
    top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin] 

    #To DataFrame 
    df_match = pd.DataFrame({'unique': uq_origin}) 
    df_match['top_match'] = top_match 

    #Merge 
    df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique') 
    df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut') 

    return df_cam 

df_out = TopFuzzMatch(df_cam, df_dict)