2013-07-08 64 views
2

是否有可能在SQL Server中使用全文查找字符串像1.1.1或1.5.2(多級段落)? 我的SQL看起來像這樣:sql-server全文搜索多級列表字符串

contains (MyTable.MyColumn,'"*5.1.1*"') 

我已經嘗試從停止詞列表中刪除的號碼或完全禁用停用詞列表。因此,像5.1或1.1這樣的字符串可以正常工作(也許在內部,它們會以數字的形式處理?),但對於2個點的數字仍然沒有結果。

有沒有辦法逃避這些虛線字符串/數字,或任何其他解決方案?

回答

1

在全文檢索中,期間存在問題,因爲它們通常被視爲單詞之間的完全中止。解決方案是用不同的字符替換句點,只需對應用程序進行最少的更改即可完成。這是一個相當長的腳本,可以引導您識別問題並找到解決方案。如果你想要的只是解決方法,你可以跳到「簡答」版本。

設置全文模式

SET ANSI_NULLS ON 
SET QUOTED_IDENTIFIER ON 
SET ANSI_PADDING ON 

CREATE TABLE [dbo].[FT_Test](
    [id] [int] IDENTITY(1,1) NOT NULL, 
    [TextData] [varchar](max) NOT NULL, 
CONSTRAINT [PK_FT_Test] PRIMARY KEY CLUSTERED 
(
    [id] ASC 
) ON [PRIMARY] 
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY] 

GO 
CREATE FULLTEXT CATALOG [ft_default] WITH ACCENT_SENSITIVITY = ON 
CREATE FULLTEXT INDEX ON [dbo].[FT_Test] KEY INDEX [PK_FT_Test] ON ([ft_default]) 
    WITH (CHANGE_TRACKING AUTO) 
ALTER FULLTEXT INDEX ON [dbo].[FT_Test] ADD ([TextData]) 
ALTER FULLTEXT INDEX ON [dbo].[FT_Test] ENABLE 

驗證SQL Server版本

這個腳本的設計是圍繞SQL服務器2012年,而應適用於2008年爲好。 Sql 2008和Sql 2012之間的斷言符發生了很大變化(至少對於語言ID 1033 - 美國英語)。主要含義是,1-2-3被破碎成1,2,3,1-2-3,NN1,NN2,NN3(包括1-2-3是新)

go 
PRINT 'Version 14.0.4763.1000 is Sql Server 2012' 
EXEC master.sys.sp_help_fulltext_system_components @component_type = 'wordbreaker', @param=1033 

SQL服務器半智能地解析關鍵字

不幸的是,這是對我們目前的工作。由於多次存儲相同的數據並導致錯誤的搜索結果,因此我們會感到臃腫。

go 
DELETE FROM ft_test 
INSERT INTO dbo.FT_Test (TextData) 
VALUES 
( '1.1.1 5.2.1, 7.1.1.34.69; 12.11.10.9.8 4.6 7/13/2013 15,456.345') 

WAITFOR DELAY '00:00:05' 
--Wait 5 seconds for ft index to populate 

SELECT ft_test.*, ft_content.display_term, ft_content.occurrence_count 
FROM sys.dm_fts_index_keywords_by_document(DB_ID(), OBJECT_ID('ft_test')) ft_content 
     INNER JOIN dbo.FT_Test ON document_id = id 
ORDER BY id, keyword 
--Notice what is returned,the two digit numbers are identified, but the 1 digit numbers aren't (due to default stoplist). 
--Also, note that they are treated as distinct items and are broken up. 4.6 does show up because it is a decimal number. 
--the nn* display_terms are standardized numeric (also, note how the date got standardized as dd20120713 in addition to 7/13/2013) 

SELECT * 
FROM ft_test 
WHERE CONTAINS (*, '"5.2*"') -- No results, 5 and 2 are in default stopword list. 

SELECT * 
FROM ft_test 
WHERE CONTAINS (*, '"12.11*"') -- periods are hard breaks, so this doesn't work either 

創建自定義的索引字表索引個位數

個位數通常是不值錢的問候全文搜索,但我們需要他們。我們將使用默認的系統停止列表作爲基礎。

CREATE FULLTEXT STOPLIST [no_numbers] 
FROM SYSTEM STOPLIST 
AUTHORIZATION [dbo]; 
go 
ALTER FULLTEXT STOPLIST [no_numbers] DROP '0' LANGUAGE 'English'; 
ALTER FULLTEXT STOPLIST [no_numbers] DROP '1' LANGUAGE 'English'; 
ALTER FULLTEXT STOPLIST [no_numbers] DROP '2' LANGUAGE 'English'; 
ALTER FULLTEXT STOPLIST [no_numbers] DROP '3' LANGUAGE 'English'; 
ALTER FULLTEXT STOPLIST [no_numbers] DROP '4' LANGUAGE 'English'; 
ALTER FULLTEXT STOPLIST [no_numbers] DROP '5' LANGUAGE 'English'; 
ALTER FULLTEXT STOPLIST [no_numbers] DROP '6' LANGUAGE 'English'; 
ALTER FULLTEXT STOPLIST [no_numbers] DROP '7' LANGUAGE 'English'; 
ALTER FULLTEXT STOPLIST [no_numbers] DROP '8' LANGUAGE 'English'; 
ALTER FULLTEXT STOPLIST [no_numbers] DROP '9' LANGUAGE 'English'; 
GO 

基於重新掀起新的終止列表

這有助於一些和我們更加接近的地方,我們希望我們的全文索引。

DROP FULLTEXT INDEX ON dbo.FT_Test 

CREATE FULLTEXT INDEX ON [dbo].[FT_Test] (TextData) KEY INDEX [PK_FT_Test] ON ([ft_default]) 
    WITH (CHANGE_TRACKING AUTO, STOPLIST = [no_numbers]) 

WAITFOR DELAY '00:00:05' 
--Wait 5 seconds for ft index to populate 

SELECT ft_test.*, ft_content.display_term, ft_content.occurrence_count 
FROM sys.dm_fts_index_keywords_by_document(DB_ID(), OBJECT_ID('ft_test')) ft_content 
     INNER JOIN dbo.FT_Test ON document_id = id 
ORDER BY id, keyword 

--Progress, now single digits are showing up 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('1 1 14.123') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('5.2.1.1.14') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('1.1.3 ') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('2.2.3.3') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('6.0 88.00.00') 

--This works in the first 3 cases, but doesn't work for 2.2 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '1.1.1*' ) ct ON ct.[key] = ft_test.id 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '2.2.3.3*' ) ct ON ct.[key] = ft_test.id 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '2.2.3*' ) ct ON ct.[key] = ft_test.id 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '2.2*' ) ct ON ct.[key] = ft_test.id 

--Double quoting makes it match more stuff, but still is broken. 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"1.1.1*"' ) ct ON ct.[key] = ft_test.id 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"1.1*"' ) ct ON ct.[key] = ft_test.id 

我們現在絕對接近了,但上面的2.2 *例子很煩人。它被解析爲一個十進制數:

declare @stoplistId INT 
SET @stoplistid = (SELECT stoplist_id FROM sys.fulltext_stoplists WHERE name ='no_numbers') 
SELECT * FROM sys.dm_fts_parser('"1.1.1*"', 1033,@stoplistId, 0) 
SELECT * FROM sys.dm_fts_parser('1.1.1*', 1033,@stoplistId, 0) 
SELECT * FROM sys.dm_fts_parser('"1.1*"', 1033,@stoplistId, 0) 
SELECT * FROM sys.dm_fts_parser('1.1*', 1033,@stoplistId, 0) 

什麼其他的字符作爲潛在分離?

讓我們來試一下,看看有沒有跳出來。我們可以嘗試諸如「XXXDOTXXX」之類的內容,但如果可能的話,將它保留爲單個字符會更清晰。

INSERT INTO dbo.FT_Test (TextData) 
VALUES 
( '1-1-1 [email protected]@2 3#3#3 4$4$4 5%5%5 6^6^6 7&7&7 8*8*8 9=9=9 10_10_10 11|11|11 12:12:12 12:12:12:12 13"13"13" 14~14~14 15`15`15') 

SELECT ft_test.*, ft_content.display_term, ft_content.occurrence_count 
FROM sys.dm_fts_index_keywords_by_document(DB_ID(), OBJECT_ID('ft_test')) ft_content 
     INNER JOIN dbo.FT_Test ON document_id = id WHERE textdata LIKE '1-1-1%' 
ORDER BY id, keyword 

DELETE FROM ft_test WHERE textdata LIKE '%3#3#3%' 

似乎可以使用連字符,下劃線或反引號。讓我們更詳細地檢查這些。

INSERT INTO dbo.FT_Test (TextData) 
VALUES ('3`3`3`4 1`2`3 6`1`2`3`4 ') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('5-5-5-6 2-3-4 6-1-2-3-4-5') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('6_6_6_7 3_4_5 7_1_2_3_4_5_6') 


SELECT ft_test.*, ft_content.display_term, ft_content.occurrence_count 
FROM sys.dm_fts_index_keywords_by_document(DB_ID(), OBJECT_ID('ft_test')) ft_content 
     INNER JOIN dbo.FT_Test ON document_id = id WHERE textdata LIKE '3`3%' OR TextData LIKE '5-5%' OR textdata LIKE '6_6%' 
ORDER BY id, keyword 
--Hyphen isn't looking good now, it gets stored 3 times, as numbers, as individual digits and as a full string. 

--Let's try backquote: 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"3`3*"' ) ct ON ct.[key] = ft_test.id 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"1`2`3*"' ) ct ON ct.[key] = ft_test.id 

-- these match anything with a single 6... not good... 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6`*"' ) ct ON ct.[key] = ft_test.id 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '6`*' ) ct ON ct.[key] = ft_test.id 

--the backquote is getting dropped when it's parsed 
declare @stoplistId INT 
SET @stoplistid = (SELECT stoplist_id FROM sys.fulltext_stoplists WHERE name ='no_numbers') 
SELECT * FROM sys.dm_fts_parser('"6`*"', 1033,@stoplistId, 0) 
SELECT * FROM sys.dm_fts_parser('6`*', 1033,@stoplistId, 0) 


--Underscore is just about all we have left. 
declare @stoplistId INT 
SET @stoplistid = (SELECT stoplist_id FROM sys.fulltext_stoplists WHERE name ='no_numbers') 
SELECT * FROM sys.dm_fts_parser('"2_*"', 1033,@stoplistId, 0) 
SELECT * FROM sys.dm_fts_parser('2_*', 1033,@stoplistId, 0) 
SELECT * FROM sys.dm_fts_parser('2_2*', 1033,@stoplistId, 0) 
SELECT * FROM sys.dm_fts_parser('2_2*', 1033,@stoplistId, 0) 
SELECT * FROM sys.dm_fts_parser('2_2_*', 1033,@stoplistId, 0) 
SELECT * FROM sys.dm_fts_parser('2_2_*', 1033,@stoplistId, 0) 


INSERT INTO dbo.FT_Test (TextData) 
VALUES ('6_6_66_7 77_6_6_6') 


-- 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6_*"' ) ct ON ct.[key] = ft_test.id  WHERE textdata LIKE '%[_]%' 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '6_*' ) ct ON ct.[key] = ft_test.id  WHERE textdata LIKE '%[_]%' 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '6_6*' ) ct ON ct.[key] = ft_test.id  WHERE textdata LIKE '%[_]%' 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '2_3*' ) ct ON ct.[key] = ft_test.id  WHERE textdata LIKE '%[_]%' 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '6_6*' ) ct ON ct.[key] = ft_test.id  WHERE textdata LIKE '%[_]%' 

SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '6_6_6_7*' ) ct ON ct.[key] = ft_test.id  WHERE textdata LIKE '%[_]%' 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6_6_6_7*"' ) ct ON ct.[key] = ft_test.id WHERE textdata LIKE '%[_]%' 

SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6_6_6_*"' ) ct ON ct.[key] = ft_test.id  WHERE textdata LIKE '%[_]%' 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6_6_6*"' ) ct ON ct.[key] = ft_test.id  WHERE textdata LIKE '%[_]%' 
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6_6_*"' ) ct ON ct.[key] = ft_test.id  WHERE textdata LIKE '%[_]%' 

簡答

用下劃線

下劃線更換週期是要走的路。它被視爲一個字符,而不是標點符號。 Sql Server可以在計算列上創建全文索引。這將允許我們使用公式來「修復」數據,對其進行索引,並在沒有額外存儲的情況下進行查詢(並且以最小的開銷)。您需要修改應用程序以查詢「1_2_3」而不是「1.2.3」。

--naive implementation 
ALTER TABLE ft_test ADD [TextData_FT1] AS ([textdata]+' '+replace([TextData],'.','_')) 

--strip all characters. You can customize to get pull out only the paragraph numbers 
ALTER TABLE ft_test ADD [TextData_FT2] AS (REPLACE(REPLACE(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(UPPER([TextData]) 
,'A', ' '),'B', ' '),'C', ' '),'D', ' '),'E', ' '),'F', ' '), 
'G', ' '),'H', ' '),'I', ' '),'J', ' '),'K', ' '),'L', ' '),'M', ' '),'N', ' '), 
'O', ' '),'P', ' '),'Q', ' '),'R', ' '),'S', ' '),'T', ' '),'U', ' '),'V', ' '), 
'W', ' '),'X', ' '),'Y', ' '),'Z', ' '), '.','_') , ' ',' ') 
) 


--Add computed columns to FT index 
ALTER FULLTEXT INDEX ON [dbo].[FT_Test] ADD ([TextData_FT1]) 
ALTER FULLTEXT INDEX ON [dbo].[FT_Test] ADD ([TextData_FT2]) 


DELETE FROM dbo.FT_Test 

INSERT INTO dbo.FT_Test (TextData) 
VALUES ('1 This is the chapter title') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('1.1 Section heading') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('1.1.1 paragraph 1 is very interesting') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('1.1.2 paragraph two is better') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('1.2 Another Section') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('1.2.1 Foobar qwerty loren ipsum') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('1.2.2 Foobar2 qwerty2 loren ipsum 12 items ') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('1.2.12 Foobar2 qwerty2 loren ipsum ') 
INSERT INTO dbo.FT_Test (TextData) 
VALUES ('2.2.17 sql server is great. ') 

--naive implementation 
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft1, '"1_1*"') 
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft1, '1*') 
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft1, '2*') -- 
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft1, '"1_1_2*"') 
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft1, '1_1_2*') 

--only index the paragraph identifiers 
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft2, '"1_1*"') 
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft2, '1*') 
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft2, '2*') -- 
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft2, '"1_1_2*"') 
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft2, '1_1_2*') 
+0

謝謝。但確實看起來有點貴。是否真的沒有辦法讓全文引擎知道如何對待這個點的不同? – user250773

+0

寫你自己的解析器(你不想做的,相信我),沒有。這種方法並沒有真正增加那麼多開銷,因爲替換字符的公式只在行被索引時才被評估,而不是在被搜索時被評估。解析器是Windows搜索功能的一部分,適用於電子郵件消息,文檔文檔等。對於特定於域的索引,它們有缺陷。 – StrayCatDBA