試試這個嗎?
DECLARE @CompanyNames TABLE (
CompanyName VARCHAR(512));
INSERT INTO @CompanyNames VALUES ('Walt Disney - Walt Disney');
INSERT INTO @CompanyNames VALUES ('Fun Food - Fun Food');
INSERT INTO @CompanyNames VALUES ('Fun Food, Inc. - Fun Food');
INSERT INTO @CompanyNames VALUES ('Walt Disney - Walt Disney, Inc.');
--Split names
DECLARE @SplitNames TABLE (
MatchLeft VARCHAR(128),
MatchRight VARCHAR(128));
INSERT INTO
@SplitNames
SELECT
RTRIM(SUBSTRING(CompanyName, 0, CHARINDEX('-', CompanyName))),
LTRIM(SUBSTRING(CompanyName, CHARINDEX('-', CompanyName, 0) + 1, LEN(CompanyName)))
FROM
@CompanyNames;
--Exact matches
SELECT
MatchLeft,
MatchRight,
CASE WHEN MatchLeft = MatchRight THEN 1 ELSE 0 END AS Exact
FROM
@SplitNames;
--Inexact matches
WITH CleansedCompanyNames AS (
SELECT
MatchLeft AS OriginalMatchLeft,
MatchRight AS OriginalMatchRight,
REPLACE(REPLACE(REPLACE(MatchLeft, '.', ''), 'Inc', ''), ',', '') AS MatchLeft,
REPLACE(REPLACE(REPLACE(MatchRight, '.', ''), 'Inc', ''), ',', '') AS MatchRight
FROM
@SplitNames)
SELECT
OriginalMatchLeft,
OriginalMatchRight,
MatchLeft,
MatchRight,
CASE WHEN MatchLeft = MatchRight THEN 1 ELSE 0 END
FROM
CleansedCompanyNames;
--Using SOUNDEX
SELECT
MatchLeft,
MatchRight,
CASE WHEN DIFFERENCE(MatchLeft, MatchRight) >= 3 THEN 1 ELSE 0 END AS Score
FROM
@SplitNames;
有兩種思路有處理不精確匹配:
- 要麼匹配之前刪除標點和不受歡迎的詞彙(但這將需要建立什麼來代替列表);或
- 使用SOUNDEX來測試字符串相似性。
或者,要使用你原來的例子,你可以用這個SOUNDEX:
SELECT ...
WHERE
DIFFERENCE(RTRIM(substring(c.companyname,0,charindex('-',c.companyname))), LTRIM(substring(c.companyname, charindex('-',c.companyname,0)+1, len(c.companyname)))) >= 3
和你最新的例子:
DECLARE @Company TABLE (
companyname VARCHAR(500));
INSERT INTO @Company VALUES ('Allen Limited - Allen Corporation');
INSERT INTO @Company VALUES ('Sweden Corp. - Sweden Corp.');
INSERT INTO @Company VALUES ('Alaska Limited - Alaska Limited, Inc.');
INSERT INTO @Company VALUES ('New York Inc. - New York Steel Limited');
INSERT INTO @Company VALUES ('India Plc - India Plc.');
INSERT INTO @Company VALUES ('Dubai International - Dubai International');
INSERT INTO @Company VALUES ('Nigera Falls Pvt. Ltd. - Amazing Nigeria Falls');
SELECT
c.companyname,
DIFFERENCE(RTRIM(SUBSTRING(c.companyname, 0, CHARINDEX('-', c.companyname))), LTRIM(SUBSTRING(c.companyname, CHARINDEX('-', c.companyname, 0) + 1, LEN(c.companyname)))) AS Similarity
FROM
@Company c;
有了結果:
companyname Similarity
Allen Limited - Allen Corporation 4
Sweden Corp. - Sweden Corp. 4
Alaska Limited - Alaska Limited, Inc. 4
New York Inc. - New York Steel Limited 4
India Plc - India Plc. 4
Dubai International - Dubai International 4
Nigera Falls Pvt. Ltd. - Amazing Nigeria Falls 1
所以對於你的最後一個例子來說效果不太好,但對於t來說似乎很好他人呢?
請編輯您的問題,並提供樣本數據和期望的結果。 –
名稱 迪斯尼 - 迪斯尼 趣香食品 - 趣味食品 趣香食品有限公司 - 趣味食品 迪斯尼 - 迪斯尼公司 我已經有查詢允許我拉ROW1和2只RTRIM(子(c.companyname,0,charindex(' - ',c.companyname)))= LTRIM(substring(c.companyname,charindex(' - ',c.companyname,0)+1,len(c.companyname)) )我想要一個查詢,我可以得到row3和row4,因爲這些也是相同的名字姓氏的情況下,但略有不同(例如存在一個點(。)或附加字(公司) – user3769697