慢兩種DataFrames

之間的模糊匹配我的數據框A（df_cam）與CLI ID和原產地：與快捷方式和活動慢兩種DataFrames

cli id |   origin 
------------------------------------ 
123 | 1234 M-MKT XYZklm 05/2016

而且數據框B（df_dict）

shortcut |   campaign 
------------------------------------ 
M-MKT | Mobile Marketing Outbound

我知道，示例客戶端來源1234 M-MKT XYZklm 05/2016實際上來自廣告系列Mobile Marketing Outbound，因爲它包含關鍵字M-MKT。

注意，快捷方式是一般的關鍵詞，基於算法應該決定什麼。起源也可以是M-Marketing,MMKT或Mob-MKT。我首先通過分析所有來源手動創建了快捷方式列表。我也使用正則表達式來清除origin，然後將其提取到程序中。

我想通過快捷與運動相匹配的客戶來源和重視分數看出區別。正如下面顯示：

cli id | shortcut |   origin   |  campaign   | Score 
--------------------------------------------------------------------------------- 
123 | M-MKT | 1234 M-MKT XYZklm 05/2016 | Mobile Marketing Outbound | 0.93

下面是我的程序，它的工作原理，但真正慢。 DataFrame A具有〜400.000行，另一個DataFrame B具有〜40行。

有沒有辦法讓我的速度更快？

from fuzzywuzzy import fuzz 
list_values = df_dict['Shortcut'].values.tolist() 

def TopFuzzMatch(tokenA, dict_, position, value): 
    """ 
    Calculates similarity between two tokens and returns TOP match and score 
    ----------------------------------------------------------------------- 
    tokenA: similarity to this token will be calculated 
    dict_a: list with shortcuts 
    position: whether I want first, second, third...TOP position 
    value: 0=similarity score, 1=associated shortcut 
    ----------------------------------------------------------------------- 
    """ 
    sim = [(fuzz.token_sort_ratio(x, tokenA),x) for x in dict_] 
    sim.sort(key=lambda tup: tup[0], reverse=True) 
    return sim[position][value] 

df_cam['1st_choice_short'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,1), axis=1) 
df_cam['1st_choice_sim'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,0), axis=1)

請注意，我還想計算第2次和第3次最佳匹配以評估準確性。

編輯

我發現process.ExtractOne方法，但速度保持不變。所以我的代碼看起來像現在這樣：

def TopFuzzMatch(token, dict_, value): 
    score = process.extractOne(token, dict_, scorer=fuzz.token_sort_ratio) 
    return score[value]

來源

2016-11-25 HonzaB

我找到了解決辦法 - 在我乾淨的正則表達式起源柱（沒有數字和特殊字符），也有隻有幾百重複不同的值，所以我計算Fuzz算法就在那些上，這大大縮短了時間。

def TopFuzzMatch(df_cam, df_dict): 
    """ 
    Calculates similarity bewteen two tokens and return TOP match 
    The idea is to do it only over distinct values in given DF (takes ages otherwise) 
    ----------------------------------------------------------------------- 
    df_cam: DataFrame with client id and origin 
    df_dict: DataFrame with abbreviation which is matched with the description i need 
    ----------------------------------------------------------------------- 
    """ 
    #Clean special characters and numbers 
    df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1) 

    #Get unique values and calculate similarity 
    uq_origin = np.unique(df_cam['clean_camp'].values.ravel()) 
    top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin] 

    #To DataFrame 
    df_match = pd.DataFrame({'unique': uq_origin}) 
    df_match['top_match'] = top_match 

    #Merge 
    df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique') 
    df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut') 

    return df_cam 

df_out = TopFuzzMatch(df_cam, df_dict)

來源

2016-11-29 09:15:47 HonzaB

慢兩種DataFrames

回答

相關問題