2017-09-18 60 views
0

我有來自不同來源的客戶和銷售線索,我需要弄清楚客戶是否已經註冊爲銷售線索。MySQL數據匹配:更好的選擇?

我使用的匹配12個字段:

address1_clear 
address2_clear 
address_clear 
contact_name_clear 
email 
invoice_mobile 
invoice_phone 
mobile 
name_clear 
phone 
phone2 
taxnum 

_clear後綴意味着數據是在小寫,W/O空格和標點)。

  • 線索 - 300K記錄
  • 客戶 - 500K記錄
  • customers_leads - 460K記錄

這是用來進行匹配查詢:

SELECT l.id as lead_id, c.id as customer_id FROM lead l 
INNER JOIN sync_settings s ON s.account_id = l.account_id 
INNER JOIN customers c ON c.setting_id = s.id 
LEFT JOIN customers_leads cl ON cl.customer_id = c.id AND cl.lead_id = l.id 
WHERE cl.lead_id IS NULL AND 
(
    (l.phone IS NOT NULL AND l.phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR 
    (l.mobile IS NOT NULL AND l.mobile != "" AND l.mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR 
    (l.invoice_phone IS NOT NULL AND l.invoice_phone != "" AND l.invoice_phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR 
    (l.invoice_mobile IS NOT NULL AND l.invoice_mobile != "" AND l.invoice_mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR 
    (l.email IS NOT NULL AND l.email != "" AND l.email = c.email) OR 
    (l.taxnum IS NOT NULL AND l.taxnum != "" AND l.taxnum = c.taxnum) OR 
    (l.contact_name_clear IS NOT NULL AND l.contact_name_clear != "" AND l.contact_name_clear = c.contact_name_clear) OR 
    (l.address1_clear IS NOT NULL AND l.address1_clear != "" AND l.address1_clear = c.address_clear) OR 
    (l.address2_clear IS NOT NULL AND l.address2_clear != "" AND l.address2_clear = c.address_clear) OR 
    (l.name_clear IS NOT NULL AND l.name_clear != "" AND l.name_clear IN (c.contact_name_clear, c.name_clear)) 
) 

它是超重型,響應時間是~4分鐘。由於OR和附加條件,索引沒有多大幫助。

我想知道:有沒有更好的方法來做到這一點?也許使用一些NoSQL數據庫基本上構建一個巨大的散列表或一些數據匹配技術,我沒有能夠谷歌?

P. S.我知道我可以製作單獨的表格來純粹匹配字段,它會更快,但我仍然想知道我的替代方案。

回答

0

您遇到的問題叫做record linkage,並且沒有本地解決該問題的數據庫解決方案。

有許多開源項目可以使用,包括Dukededupe(我是主要作者重複數據刪除)。

1

另一個需要考慮的開源項目是recordlinkage(Python Record Linkage Toolkit)。該項目的documentation包括記錄鏈接過程的概述,初學者的代碼示例和API文檔。