2016-06-24 34 views
-1

一個表包含Title字段和Tags字段。標籤是通過來自文檔的潛在狄利克雷分配(LDA)生成的,並且可以是例如'魚,烤箱,時間','燒烤,啤酒'或'肉,燒烤'。標籤的長度不固定。TSQL匹配儘可能多的逗號分隔標籤

給定一組標籤,如何找到最大數量的標籤匹配的記錄,而不管標籤的順序如何?

所以,如果'燒烤,肉'給出最好的結果應該是'肉,燒烤'。如果'燒烤,魚,奶油'被給出所有三個記錄可以返回(他們都有一個匹配的標籤)。

+0

拆分出來的話到一個單獨的表中的每行與AA FK填充迴文件 – Paparazzi

+0

「一條魚,兩條魚,紅色的魚,藍色的魚」中「魚」匹配多少次?這算作一個還是五個? _(四,先生!)_ – HABO

回答

1

使用this function和創建這個

CREATE FUNCTION dbo.getCountOfMatch (@mainString VARCHAR(MAX), @searchString nvarchar(max)) 
    RETURNS 
     INT 
    AS 
    BEGIN 
     DECLARE @returnCount INT 

     SELECT 
      @returnCount = COUNT(1) 
     FROM 
      splitstring(@mainString) A INNER JOIN 
      splitstring(@searchString) B ON A.Name = B.Name 

     RETURN @returnCount 
    END 

SELECT TOP 1 // What you want 
     Title, 
     Tags 
    FROM 
    (
     SELECT 
      A.Title, 
      A.Tags, 
      dbo.getCountOfMatch(A.Tags, @search) CountTags -- The number of matches. 
     FROM 
      TABLE A 
    ) B 
    ORDER BY B.CountTags DESC 

修訂

DECLARE @searchText NVARCHAR(MAX) = 'BBQ, meat' 
DECLARE @query NVARCHAR(MAX) = ' 
     SELECT 
      * 
     FROM 
      Table 
     WHERE ' 

SELECT 
    @query += 
    (
     SELECT 
      'Tags like ''%' + A.Name + '%'' AND ' -- Dont forget trim! 
     FROM 
      splitstring(@searchText) A 
     FOR XML PATH ('') 
    ) 

SELECT @query = LEFT(@query, LEN(@query) - 4) + 'ORDER BY LEN(Tags)' -- For exactly matching: LEN(Tags) = LEN(@searchText) 

EXEC sp_executesql @query 

查詢的樣子;

SELECT 
     * 
    FROM 
     Table 
    WHERE 
     Tags like '%BBQ%' AND 
     Tags like '%meat%' 
    ORDER BY LEN(Tags) 
+0

工作良好,但速度很慢,即使是小樣本(14K行需要17秒)。 –

+0

答案已更新。你能再試一次嗎? – NEER

0

結合兩個UDF,您可以返回搜索的命中率(百分比)。

例如

Select [dbo].[udf-Str-Match-Rate]('Dog,House,Custom',',','The dog house is red',' ') 

返回0.6666 - 2的3個字/發現短語。

每個人都可以有自己的分隔符

只有區別詞進行測試,以避免膨脹的結果

我還包括一個同音(可選的)

1 UDF是獨立和可獨立使用。

CREATE FUNCTION [dbo].[udf-Str-Parse] (@String varchar(max),@delimeter varchar(10)) 
--Usage: Select * from [dbo].[udf-Str-Parse]('Dog,Cat,House,Car',',') 
--  Select * from [dbo].[udf-Str-Parse]('John Cappelletti was here',' ') 
--  Select * from [dbo].[udf-Str-Parse]('id26,id46|id658,id967','|') 

Returns @ReturnTable Table (Key_PS int IDENTITY(1,1) NOT NULL , Key_Value varchar(max)) 

As 

Begin 
    Declare @intPos int,@SubStr varchar(max) 
    Set @IntPos = CharIndex(@delimeter, @String) 
    Set @String = Replace(@String,@[email protected],@delimeter) 
    While @IntPos > 0 
     Begin 
     Set @SubStr = Substring(@String, 0, @IntPos) 
     Insert into @ReturnTable (Key_Value) values (@SubStr) 
     Set @String = Replace(@String, @SubStr + @delimeter, '') 
     Set @IntPos = CharIndex(@delimeter, @String) 
     End 
    Insert into @ReturnTable (Key_Value) values (@String) 
    Return 
End 

第二UDF需要先

CREATE FUNCTION [dbo].[udf-Str-Match-Rate] (@SearchFor varchar(max),@SearchForDelim varchar(5),@SearchIn varchar(max),@SearchInDelim varchar(5)) 

-- Syntax : Select [dbo].[udf-Str-Match-Rate]('Dog,House,Custom',',','The dog house is red',' ') 

Returns money 
AS 
    BEGIN 

    Declare @RetVal money 

    ;with cteSearchFor as (Select Distinct Key_Value from [dbo].[udf-Str-Parse](@SearchFor ,@SearchForDelim)) 
     ,cteSearchIn as (Select Distinct Key_Value from [dbo].[udf-Str-Parse](@SearchIn,@SearchInDelim)) 
     ,cteWordCnt as (Select Words=cast(count(*) as money) From cteSearchFor) 
    Select @RetVal = isnull(Count(*)/max(Words),0) 
     From cteSearchFor S 
     Join cteWordCnt W on 1=1 
     Join cteSearchIn C 
     on S.Key_Value = C.Key_Value 
     or Soundex(S.Key_Value) = Soundex(C.Key_Value) 

    Return @RetVal 

    END 
+0

Soundex可能會返回太多的命中 – Paparazzi

0

創建表的標籤和使用stringsplit

tags 
title PK 
tag PK 

select title, count(*) 
from tags 
where tag in ('BBQ', 'fish', 'cream') 
group by title 
having count(*) > 1 

SELECT * 
from table 
join dbo.splitstring(table.tags)