2016-06-08 49 views
3

我構建了一個查詢,該查詢可以查找列的最長公共子串並按頻率排序。我遇到的問題是刪除/分組類似的結果。如何在SQL Server中結合用於計算的相似字符串

下面是來自以下代碼的TOP 5輸出 - 注意「我愛貓咪連指手套」是最長,最常見的字符串,但代碼還查找該字符串的所有子公司,例如「我喜歡連指手套」或「我喜歡連指手套」。

I love Mittens the cat 3 
    I love Mittens the ca 3 
    love Mittens the cat 3 
    love Mittens the ca 3 
    I love Mittens the c 3 

如果可能,我想刪除任何類似於其他字符的子字符串。第3行就可以了,因爲它是所有完整的單詞,但行4和5應該被刪除,因爲他們是類似於第1行

DECLARE  @MinLength INT   = 5  --Minimum Substring Length 
DECLARE  @MaxLength INT   = 50 --Maximum Substring Length 
DECLARE  @Delimeter VARCHAR(5) = ' ' 
DECLARE  @T TABLE 
      (
        ID INT IDENTITY 
       , chvStrings VARCHAR(64) 
      ) 
INSERT INTO @T VALUES 
      ('I like cats'), 
      ('I like dogs'), 
      ('cats are great'), 
      ('look at that cat'), 
      ('I love Mittens the cat'), 
      ('I love Mittens the cat a lot'), 
      ('I love Mittens the cat so much'), 
      ('Dogs are okay, I guess...') 

SELECT TOP 10000 
    SUBSTRING(T.chvStrings, N.Number, M.Number) AS Word, 
    COUNT(M.number) AS [Count] 
FROM   
    @T as T 
CROSS APPLY 
    (SELECT N.Number 
    FROM [master]..spt_values as N 
    WHERE N.type = 'P' 
     AND N.number BETWEEN 1 AND LEN(T.chvStrings)) N 
CROSS APPLY 
    (SELECT N.Number 
    FROM [master]..spt_values as N 
    WHERE N.type = 'P' 
     AND N.number BETWEEN @MinLength AND @MaxLength) M 
WHERE  
    N.number <= LEN(t.chvStrings) - M.number + 1 
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '% ' 
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '%[_]%' 
    AND (SUBSTRING(T.chvStrings, N.Number,1) = @Delimeter OR N.number = 1) 
GROUP BY 
    SUBSTRING(T.chvStrings, N.Number, M.Number)      
ORDER BY  
    COUNT(T.chvStrings) DESC, 
    LEN(SUBSTRING(T.chvStrings, N.Number, M.Number)) DESC 
+1

除非你有所有詞的表,你會怎麼知道什麼字母組合成一個完整的單詞? –

+2

這樣一個很棒的工作發佈樣本數據,並使人們很容易幫助。我只希望我知道你在這裏做什麼。我不清楚你真的想完成什麼。 –

+0

我可以製作一個帶有分隔符功能的所有單詞表,但是看起來好像很費勁。 爲了闡述一下,讓我們假設我爲斯臺普斯工作,我們希望看到人們在他們打印的名片上寫什麼。我們有一個數據庫字段,用於將客戶擁有的所有文本存儲爲一個字符串。我很想知道最受歡迎的名片主題是什麼。這個查詢是試圖找到這個信息,問題是你得到了很多非常類似的字符串(比如這個例子),而且它太多了。 – Fubudis

回答

1

我加了一些額外的過濾器說,串N.數字1不能包含字母[a-z0-9],並且類似的子字符串M.Number + 1不能是[a-z0-9]。

這是你需要的。修改後的代碼如下:

DECLARE  @MinLength INT   = 5  --Minimum Substring Length 
DECLARE  @MaxLength INT   = 50 --Maximum Substring Length 
DECLARE  @Delimeter VARCHAR(5) = ' ' 
DECLARE  @T TABLE 
      (
        ID INT IDENTITY 
       , chvStrings VARCHAR(64) 
      ) 
INSERT INTO @T VALUES 
      ('I like cats'), 
      ('I like dogs'), 
      ('cats are great'), 
      ('look at that cat'), 
      ('I love Mittens the cat'), 
      ('I love Mittens the cat a lot'), 
      ('I love Mittens the cat so much'), 
      ('Dogs are okay, I guess...') 

SELECT TOP 10000 
    SUBSTRING(T.chvStrings, N.Number, M.Number) AS Word, 
    COUNT(M.number) AS [Count] 
    --SUBSTRING(T.chvStrings,M.Number+1,1) 
FROM   
    @T as T 
CROSS APPLY 
    (SELECT N.Number 
    FROM [master]..spt_values as N 
    WHERE N.type = 'P' 
     AND N.number BETWEEN 1 AND LEN(T.chvStrings)) N 
CROSS APPLY 
    (SELECT N.Number 
    FROM [master]..spt_values as N 
    WHERE N.type = 'P' 
     AND N.number BETWEEN @MinLength AND @MaxLength) M 
WHERE  
    N.number <= LEN(t.chvStrings) - M.number + 1 
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '% ' 
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '%[_]%' 
    AND (SUBSTRING(T.chvStrings, N.Number,1) = @Delimeter OR N.number = 1) 
    AND SUBSTRING(T.chvStrings,M.Number+1,1) NOT LIKE '%[a-z0-9]%' 
    AND SUBSTRING(T.chvStrings,N.Number-1,1) NOT LIKE '%[a-z0-9]%' 
GROUP BY 
    SUBSTRING(T.chvStrings, N.Number, M.Number)      
ORDER BY  
    COUNT(T.chvStrings) DESC, 
    LEN(SUBSTRING(T.chvStrings, N.Number, M.Number)) DESC 
相關問題