1

給定一個特定的單詞模式(比如說「氣球」),我想查找前後的n個單詞的數量,按他們分組,計數存在於我的表格的標題中使用TSQL,如何在給定的術語之前和之後查找單詞和分組?

對於,例如,如果數據集是:

  • 紅氣球天空
  • 黃色氣球的天空路
  • 藍氣球椅

我想結果是這樣mething like:

- red balloon | 1 
- yellow balloon | 1 
- blue balloon | 1 
- balloon sky | 2 
- balloon chair | 1 

我覺得最好的方法來完成這將是在我的sproc正則表達式。因此,我添加了列出的極大正則表達式函數hereFindWordsInContext函數。

首先:

WITH Words_CTE (Title) 
AS 
-- Define the CTE query. 
(
    SELECT Title 
    FROM ItemData 
    WHERE Title LIKE '%balloon%' 
) 
-- Define the outer query referencing the CTE name. 
SELECT Title 
FROM Words_CTE 

所以我想我會開始與和工作FindWordsInContext功能混進去,然後做一個分組上的文字/給定字之前。

- 更新 -

得益於以下阿德里安Iftode ......但代碼不正是做什麼我要找的。

declare @table table(Sentence varchar(250)) 

insert into @table(sentence) 
    values ('I have another red balloon in the car.'), 
      ('Here is a new balloon for you.'), 
      ('A red balloon is in the other room.'), 
      ('Is there another balloon for me?') 


select TOP(5) SentencePart, NumberOfWords 
from @table 
cross apply dbo.fnGetPartsFromSentence(Sentence, 'balloon') f 
order by 
    NumberOfWords DESC, 
    case when f.Side = 'R' then 0 
    else 1 end 

輸出:

balloon is in the other room.  5 
I have another red balloon   4 
Here is a new balloon    4 
Is there another balloon   3 
balloon in the car.     3 

我希望能夠設置的「氣球」兩側的範圍內。在這種情況下,我們說一個字,輸出應該是:

red balloon  2 
new balloon  1 
another balloon 1 
balloon in  1 
balloon for  2 
balloon is  1 
+1

不要使用一個CTE - 使用[全文搜索](http://msdn.microsoft.com/en-us/library/ms142571.aspx) – 2012-04-03 01:29:25

+0

這樣做n需要在純sql中完成? – cctan 2012-04-03 01:29:53

+0

優先考慮速度,是的,在SQL中。使用近或包含了這些功能都很好,如果我已經知道我在尋找,術語是附近的「氣球」一詞。我想在「氣球」之前/之後得到一個,兩個和三個單詞的計數(並分組)。 – ElHaix 2012-04-03 02:11:26

回答

0

是有點大量的代碼,我會盡力解釋

首先我用了一個分裂的功能,是要分裂一個varchar由給定的varchar

CREATE FUNCTION [dbo].[fnSplitString](@str NVARCHAR(MAX),@sep NVARCHAR(MAX)) 
RETURNS TABLE 
AS 
RETURN 
    WITH a AS(
     SELECT CAST(0 AS BIGINT) AS idx1, 
       CHARINDEX(@sep,@str) idx2, 
       1 as [Level] 
     UNION ALL 
     SELECT idx2 + coalesce(nullif(LEN(@sep),0),1), 
       CHARINDEX(@sep,@str, idx2 + 1), 
       [Level] + 1 as [Level] 
     FROM a 
     WHERE idx2 > 0 
    ) 
    SELECT SUBSTRING(@str,idx1,COALESCE(NULLIF(idx2,0),LEN(@str)+1)-idx1) AS Value, 
      [Level], 
      case when idx1 = 0 then 'R' when idx2 != 0 then 'LR' else 'L' end as Side 
    FROM a 

鑑於VARCHAR 「紅色氣球天空」並且當所述分裂是空格字符它將輸出:

select * 
from dbo.fnSplitString('red balloon sky', ' ') 

Value Level Side 
red  1  R 
balloon 2  LR 
sky  3  L 

Side部分表示:如果R表示空格在單詞的右側,如果L表示空格在單詞的左側,並且如果LR則單詞被空格包圍。

當分裂是「氣球」

select * 
from dbo.fnSplitString('red balloon sky', 'balloon') 

red  1 R 
sky 2 L 

所以氣球上的右側出現和天空

有了這個有用的功能的左側出現我創建了另一個函數,它將輸出單個句子所需的格式(varchar)

create FUNCTION [dbo].[fnGetPartsFromSentence](@sentence NVARCHAR(MAX),@word NVARCHAR(MAX)) 
RETURNS TABLE 
AS 
RETURN 


with RawData as 
(select rtrim(ltrim(f.Value)) as LR, 
     (select COUNT (*) from dbo.fnSplitString(rtrim(ltrim(f.Value)), ' ')) as NumberOfWords, 
     f.Side, 
     0 as SideLevel 
from dbo.fnSplitString(@sentence, @word) as f 
where f.Side = 'R' or f.Side = 'L' 
union all 
(
    select rtrim(ltrim(f.Value)) as LR, 
     (select COUNT (*) from dbo.fnSplitString(rtrim(ltrim(f.Value)), ' ')) as NumberOfWords, 
     f.Side, 
     sl.no as SideLevel 
    from dbo.fnSplitString(@sentence, @word) as f 
    join (select 1 as no union all select 2) sl on 1 = 1 
    where f.Side = 'LR' 
) 
) 
select (case when Side = 'R' then LR + ' ' + @word 
      when Side = 'L' then @word + ' ' + LR 
      when Side = 'LR' then 
        (
         case when SideLevel = 1 then @word + ' ' + LR 
         when SideLevel = 2 then LR + ' ' + @word 
         end 
        ) 
      end) as SentencePart, 
     (case when Side = 'R' or Side = 'L' then Side 
       else   
        ( case when SideLevel = 1 then 'L' 
         when SideLevel = 2 then 'R' 
         end 
        ) 
      end) as Side, 
     NumberOfWords   
from RawData 

此功能使用前一個。首先它將逐句分開,並通過再次按空格拆分來計算拆分中的單詞。當一個單詞出現在拆分的兩側時,它會重複拆分(加入1,2個值)。

該功能也將與字級聯,這取決於哪側輸出分割這是:左,右或兩者。它也會輸出Side,這次是Left或Right。

select * 
from [dbo].[fnGetPartsFromSentence]('yellow balloon sky road','balloon') 

SentencePart  Side NumberOfWords 

yellow balloon  R   1 
balloon sky road L   2 

現在使用此功能,我可以用交叉表應用它

declare @table table(Sentence varchar(250)) 

insert into @table(sentence) 
    values ('red balloon sky'), 
      ('yellow balloon sky road'), 
      ('blue balloon chair') 


select SentencePart, NumberOfWords 
from @table 
cross apply dbo.fnGetPartsFromSentence(Sentence, 'balloon') f 
order by 
    case when f.Side = 'R' then 0 
    else 1 end 

輸出是

red balloon   1 
yellow balloon  1 
blue balloon   1 
balloon chair   1 
balloon sky road  2 
balloon sky   1 

廠當有多次出現

+0

不錯,但還不完全。它顯示圍繞目標詞詞的數量。氣球天空應= 2.看到我上面的更新。 – ElHaix 2012-04-05 18:01:24

+0

最後的「氣球天空」是從「紅氣球的天空」的句子和「氣球天路」是由「黃色氣球天空之路」 – 2012-04-05 18:08:26

+0

但我得到了你的觀點 – 2012-04-05 18:10:37

相關問題