2014-06-24 118 views
0

我有一個情況我需要做到以下幾點:名稱拆分和比較

公司名稱:

a. Split text before and after 「 - 「 
b. Generate the report where texts before and after 「 - 「 matches = exact match 
c. Generate the report where texts before and after 「 - 「 matches = similar matches 

我能到達,直到B點。其中,我能得到具有相同firsthalf和secondhalf結果(如ABC,INC。 - ABC,INC。)名稱中使用下列 -

RTRIM(substring(c.companyname,0,charindex('-',c.companyname)))= LTRIM(substring(c.companyname, charindex('-',c.companyname,0)+1, len(c.companyname))) 

但是,我不能做下一個報告(如ABC - abc或abc,inc - abc)

有人可以幫忙嗎?

+1

請編輯您的問題,並提供樣本數據和期望的結果。 –

+0

名稱 迪斯尼 - 迪斯尼 趣香食品 - 趣味食品 趣香食品有限公司 - 趣味食品 迪斯尼 - 迪斯尼公司 我已經有查詢允許我拉ROW1和2只RTRIM(子(c.companyname,0,charindex(' - ',c.companyname)))= LTRIM(substring(c.companyname,charindex(' - ',c.companyname,0)+1,len(c.companyname)) )我想要一個查詢,我可以得到row3和row4,因爲這些也是相同的名字姓氏的情況下,但略有不同(例如存在一個點(。)或附加字(公司) – user3769697

回答

0

試試這個嗎?

DECLARE @CompanyNames TABLE (
    CompanyName VARCHAR(512)); 
INSERT INTO @CompanyNames VALUES ('Walt Disney - Walt Disney'); 
INSERT INTO @CompanyNames VALUES ('Fun Food - Fun Food'); 
INSERT INTO @CompanyNames VALUES ('Fun Food, Inc. - Fun Food'); 
INSERT INTO @CompanyNames VALUES ('Walt Disney - Walt Disney, Inc.'); 

--Split names 
DECLARE @SplitNames TABLE (
    MatchLeft VARCHAR(128), 
    MatchRight VARCHAR(128)); 
INSERT INTO 
    @SplitNames 
SELECT 
    RTRIM(SUBSTRING(CompanyName, 0, CHARINDEX('-', CompanyName))), 
    LTRIM(SUBSTRING(CompanyName, CHARINDEX('-', CompanyName, 0) + 1, LEN(CompanyName))) 
FROM 
    @CompanyNames; 

--Exact matches 
SELECT 
    MatchLeft, 
    MatchRight, 
    CASE WHEN MatchLeft = MatchRight THEN 1 ELSE 0 END AS Exact 
FROM 
    @SplitNames; 

--Inexact matches 
WITH CleansedCompanyNames AS (
    SELECT 
     MatchLeft AS OriginalMatchLeft, 
     MatchRight AS OriginalMatchRight, 
     REPLACE(REPLACE(REPLACE(MatchLeft, '.', ''), 'Inc', ''), ',', '') AS MatchLeft, 
     REPLACE(REPLACE(REPLACE(MatchRight, '.', ''), 'Inc', ''), ',', '') AS MatchRight 
    FROM 
     @SplitNames) 
SELECT 
    OriginalMatchLeft, 
    OriginalMatchRight, 
    MatchLeft, 
    MatchRight, 
    CASE WHEN MatchLeft = MatchRight THEN 1 ELSE 0 END 
FROM 
    CleansedCompanyNames; 

--Using SOUNDEX 
SELECT 
    MatchLeft, 
    MatchRight, 
    CASE WHEN DIFFERENCE(MatchLeft, MatchRight) >= 3 THEN 1 ELSE 0 END AS Score 
FROM 
    @SplitNames; 

有兩種思路有處理不精確匹配:

  • 要麼匹配之前刪除標點和不受歡迎的詞彙(但這將需要建立什麼來代替列表);或
  • 使用SOUNDEX來測試字符串相似性。

或者,要使用你原來的例子,你可以用這個SOUNDEX:

SELECT ... 
WHERE 
DIFFERENCE(RTRIM(substring(c.companyname,0,charindex('-',c.companyname))), LTRIM(substring(c.companyname, charindex('-',c.companyname,0)+1, len(c.companyname)))) >= 3 

和你最新的例子:

DECLARE @Company TABLE (
    companyname VARCHAR(500)); 
INSERT INTO @Company VALUES ('Allen Limited - Allen Corporation'); 
INSERT INTO @Company VALUES ('Sweden Corp. - Sweden Corp.'); 
INSERT INTO @Company VALUES ('Alaska Limited - Alaska Limited, Inc.'); 
INSERT INTO @Company VALUES ('New York Inc. - New York Steel Limited'); 
INSERT INTO @Company VALUES ('India Plc - India Plc.'); 
INSERT INTO @Company VALUES ('Dubai International - Dubai International'); 
INSERT INTO @Company VALUES ('Nigera Falls Pvt. Ltd. - Amazing Nigeria Falls'); 
SELECT 
    c.companyname, 
    DIFFERENCE(RTRIM(SUBSTRING(c.companyname, 0, CHARINDEX('-', c.companyname))), LTRIM(SUBSTRING(c.companyname, CHARINDEX('-', c.companyname, 0) + 1, LEN(c.companyname)))) AS Similarity 
FROM 
    @Company c; 

有了結果:

companyname Similarity 
Allen Limited - Allen Corporation 4 
Sweden Corp. - Sweden Corp. 4 
Alaska Limited - Alaska Limited, Inc. 4 
New York Inc. - New York Steel Limited 4 
India Plc - India Plc. 4 
Dubai International - Dubai International 4 
Nigera Falls Pvt. Ltd. - Amazing Nigeria Falls 1 

所以對於你的最後一個例子來說效果不太好,但對於t來說似乎很好他人呢?

+0

它的工作原理但我剛剛在上面的查詢中引用了一些例子,我想檢查的記錄很多,可能不同,不僅因爲'Inc'這個詞,它可能是任何東西 - 魔術土地 - 魔法土地,趣味食物 - 趣味食物。 – user3769697

+0

Whic h是爲什麼我提供了兩個例子,SOUNDEX DIFFERENCE將評估兩個字符串的相似性,應該值得一試?就個人而言,我更喜歡更強大的距離算法(例如Jaro Winkler),但實現起來有點難,因爲它沒有內置到引擎中。 –

+0

是的,這與我給出的示例名稱一起工作,但如果必須從名爲「名稱」的列中選擇名稱,我該如何使用它。現在我們使用了 - INSERT INTO @CompanyNames VALUES('Fun Food - Fun Food');我不理解如何將它用於列? – user3769697