是否有可能在SQL Server中使用全文查找字符串像1.1.1或1.5.2(多級段落)? 我的SQL看起來像這樣:sql-server全文搜索多級列表字符串
contains (MyTable.MyColumn,'"*5.1.1*"')
我已經嘗試從停止詞列表中刪除的號碼或完全禁用停用詞列表。因此,像5.1或1.1這樣的字符串可以正常工作(也許在內部,它們會以數字的形式處理?),但對於2個點的數字仍然沒有結果。
有沒有辦法逃避這些虛線字符串/數字,或任何其他解決方案?
是否有可能在SQL Server中使用全文查找字符串像1.1.1或1.5.2(多級段落)? 我的SQL看起來像這樣:sql-server全文搜索多級列表字符串
contains (MyTable.MyColumn,'"*5.1.1*"')
我已經嘗試從停止詞列表中刪除的號碼或完全禁用停用詞列表。因此,像5.1或1.1這樣的字符串可以正常工作(也許在內部,它們會以數字的形式處理?),但對於2個點的數字仍然沒有結果。
有沒有辦法逃避這些虛線字符串/數字,或任何其他解決方案?
在全文檢索中,期間存在問題,因爲它們通常被視爲單詞之間的完全中止。解決方案是用不同的字符替換句點,只需對應用程序進行最少的更改即可完成。這是一個相當長的腳本,可以引導您識別問題並找到解決方案。如果你想要的只是解決方法,你可以跳到「簡答」版本。
設置全文模式
SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
SET ANSI_PADDING ON
CREATE TABLE [dbo].[FT_Test](
[id] [int] IDENTITY(1,1) NOT NULL,
[TextData] [varchar](max) NOT NULL,
CONSTRAINT [PK_FT_Test] PRIMARY KEY CLUSTERED
(
[id] ASC
) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
CREATE FULLTEXT CATALOG [ft_default] WITH ACCENT_SENSITIVITY = ON
CREATE FULLTEXT INDEX ON [dbo].[FT_Test] KEY INDEX [PK_FT_Test] ON ([ft_default])
WITH (CHANGE_TRACKING AUTO)
ALTER FULLTEXT INDEX ON [dbo].[FT_Test] ADD ([TextData])
ALTER FULLTEXT INDEX ON [dbo].[FT_Test] ENABLE
驗證SQL Server版本
這個腳本的設計是圍繞SQL服務器2012年,而應適用於2008年爲好。 Sql 2008和Sql 2012之間的斷言符發生了很大變化(至少對於語言ID 1033 - 美國英語)。主要含義是,1-2-3被破碎成1,2,3,1-2-3,NN1,NN2,NN3(包括1-2-3是新)
go
PRINT 'Version 14.0.4763.1000 is Sql Server 2012'
EXEC master.sys.sp_help_fulltext_system_components @component_type = 'wordbreaker', @param=1033
SQL服務器半智能地解析關鍵字
不幸的是,這是對我們目前的工作。由於多次存儲相同的數據並導致錯誤的搜索結果,因此我們會感到臃腫。
go
DELETE FROM ft_test
INSERT INTO dbo.FT_Test (TextData)
VALUES
( '1.1.1 5.2.1, 7.1.1.34.69; 12.11.10.9.8 4.6 7/13/2013 15,456.345')
WAITFOR DELAY '00:00:05'
--Wait 5 seconds for ft index to populate
SELECT ft_test.*, ft_content.display_term, ft_content.occurrence_count
FROM sys.dm_fts_index_keywords_by_document(DB_ID(), OBJECT_ID('ft_test')) ft_content
INNER JOIN dbo.FT_Test ON document_id = id
ORDER BY id, keyword
--Notice what is returned,the two digit numbers are identified, but the 1 digit numbers aren't (due to default stoplist).
--Also, note that they are treated as distinct items and are broken up. 4.6 does show up because it is a decimal number.
--the nn* display_terms are standardized numeric (also, note how the date got standardized as dd20120713 in addition to 7/13/2013)
SELECT *
FROM ft_test
WHERE CONTAINS (*, '"5.2*"') -- No results, 5 and 2 are in default stopword list.
SELECT *
FROM ft_test
WHERE CONTAINS (*, '"12.11*"') -- periods are hard breaks, so this doesn't work either
創建自定義的索引字表索引個位數
個位數通常是不值錢的問候全文搜索,但我們需要他們。我們將使用默認的系統停止列表作爲基礎。
CREATE FULLTEXT STOPLIST [no_numbers]
FROM SYSTEM STOPLIST
AUTHORIZATION [dbo];
go
ALTER FULLTEXT STOPLIST [no_numbers] DROP '0' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [no_numbers] DROP '1' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [no_numbers] DROP '2' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [no_numbers] DROP '3' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [no_numbers] DROP '4' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [no_numbers] DROP '5' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [no_numbers] DROP '6' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [no_numbers] DROP '7' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [no_numbers] DROP '8' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [no_numbers] DROP '9' LANGUAGE 'English';
GO
基於重新掀起新的終止列表
這有助於一些和我們更加接近的地方,我們希望我們的全文索引。
DROP FULLTEXT INDEX ON dbo.FT_Test
CREATE FULLTEXT INDEX ON [dbo].[FT_Test] (TextData) KEY INDEX [PK_FT_Test] ON ([ft_default])
WITH (CHANGE_TRACKING AUTO, STOPLIST = [no_numbers])
WAITFOR DELAY '00:00:05'
--Wait 5 seconds for ft index to populate
SELECT ft_test.*, ft_content.display_term, ft_content.occurrence_count
FROM sys.dm_fts_index_keywords_by_document(DB_ID(), OBJECT_ID('ft_test')) ft_content
INNER JOIN dbo.FT_Test ON document_id = id
ORDER BY id, keyword
--Progress, now single digits are showing up
INSERT INTO dbo.FT_Test (TextData)
VALUES ('1 1 14.123')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('5.2.1.1.14')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('1.1.3 ')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('2.2.3.3')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('6.0 88.00.00')
--This works in the first 3 cases, but doesn't work for 2.2
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '1.1.1*' ) ct ON ct.[key] = ft_test.id
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '2.2.3.3*' ) ct ON ct.[key] = ft_test.id
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '2.2.3*' ) ct ON ct.[key] = ft_test.id
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '2.2*' ) ct ON ct.[key] = ft_test.id
--Double quoting makes it match more stuff, but still is broken.
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"1.1.1*"' ) ct ON ct.[key] = ft_test.id
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"1.1*"' ) ct ON ct.[key] = ft_test.id
我們現在絕對接近了,但上面的2.2 *例子很煩人。它被解析爲一個十進制數:
declare @stoplistId INT
SET @stoplistid = (SELECT stoplist_id FROM sys.fulltext_stoplists WHERE name ='no_numbers')
SELECT * FROM sys.dm_fts_parser('"1.1.1*"', 1033,@stoplistId, 0)
SELECT * FROM sys.dm_fts_parser('1.1.1*', 1033,@stoplistId, 0)
SELECT * FROM sys.dm_fts_parser('"1.1*"', 1033,@stoplistId, 0)
SELECT * FROM sys.dm_fts_parser('1.1*', 1033,@stoplistId, 0)
什麼其他的字符作爲潛在分離?
讓我們來試一下,看看有沒有跳出來。我們可以嘗試諸如「XXXDOTXXX」之類的內容,但如果可能的話,將它保留爲單個字符會更清晰。
INSERT INTO dbo.FT_Test (TextData)
VALUES
( '1-1-1 [email protected]@2 3#3#3 4$4$4 5%5%5 6^6^6 7&7&7 8*8*8 9=9=9 10_10_10 11|11|11 12:12:12 12:12:12:12 13"13"13" 14~14~14 15`15`15')
SELECT ft_test.*, ft_content.display_term, ft_content.occurrence_count
FROM sys.dm_fts_index_keywords_by_document(DB_ID(), OBJECT_ID('ft_test')) ft_content
INNER JOIN dbo.FT_Test ON document_id = id WHERE textdata LIKE '1-1-1%'
ORDER BY id, keyword
DELETE FROM ft_test WHERE textdata LIKE '%3#3#3%'
似乎可以使用連字符,下劃線或反引號。讓我們更詳細地檢查這些。
INSERT INTO dbo.FT_Test (TextData)
VALUES ('3`3`3`4 1`2`3 6`1`2`3`4 ')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('5-5-5-6 2-3-4 6-1-2-3-4-5')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('6_6_6_7 3_4_5 7_1_2_3_4_5_6')
SELECT ft_test.*, ft_content.display_term, ft_content.occurrence_count
FROM sys.dm_fts_index_keywords_by_document(DB_ID(), OBJECT_ID('ft_test')) ft_content
INNER JOIN dbo.FT_Test ON document_id = id WHERE textdata LIKE '3`3%' OR TextData LIKE '5-5%' OR textdata LIKE '6_6%'
ORDER BY id, keyword
--Hyphen isn't looking good now, it gets stored 3 times, as numbers, as individual digits and as a full string.
--Let's try backquote:
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"3`3*"' ) ct ON ct.[key] = ft_test.id
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"1`2`3*"' ) ct ON ct.[key] = ft_test.id
-- these match anything with a single 6... not good...
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6`*"' ) ct ON ct.[key] = ft_test.id
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '6`*' ) ct ON ct.[key] = ft_test.id
--the backquote is getting dropped when it's parsed
declare @stoplistId INT
SET @stoplistid = (SELECT stoplist_id FROM sys.fulltext_stoplists WHERE name ='no_numbers')
SELECT * FROM sys.dm_fts_parser('"6`*"', 1033,@stoplistId, 0)
SELECT * FROM sys.dm_fts_parser('6`*', 1033,@stoplistId, 0)
--Underscore is just about all we have left.
declare @stoplistId INT
SET @stoplistid = (SELECT stoplist_id FROM sys.fulltext_stoplists WHERE name ='no_numbers')
SELECT * FROM sys.dm_fts_parser('"2_*"', 1033,@stoplistId, 0)
SELECT * FROM sys.dm_fts_parser('2_*', 1033,@stoplistId, 0)
SELECT * FROM sys.dm_fts_parser('2_2*', 1033,@stoplistId, 0)
SELECT * FROM sys.dm_fts_parser('2_2*', 1033,@stoplistId, 0)
SELECT * FROM sys.dm_fts_parser('2_2_*', 1033,@stoplistId, 0)
SELECT * FROM sys.dm_fts_parser('2_2_*', 1033,@stoplistId, 0)
INSERT INTO dbo.FT_Test (TextData)
VALUES ('6_6_66_7 77_6_6_6')
--
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6_*"' ) ct ON ct.[key] = ft_test.id WHERE textdata LIKE '%[_]%'
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '6_*' ) ct ON ct.[key] = ft_test.id WHERE textdata LIKE '%[_]%'
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '6_6*' ) ct ON ct.[key] = ft_test.id WHERE textdata LIKE '%[_]%'
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '2_3*' ) ct ON ct.[key] = ft_test.id WHERE textdata LIKE '%[_]%'
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '6_6*' ) ct ON ct.[key] = ft_test.id WHERE textdata LIKE '%[_]%'
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '6_6_6_7*' ) ct ON ct.[key] = ft_test.id WHERE textdata LIKE '%[_]%'
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6_6_6_7*"' ) ct ON ct.[key] = ft_test.id WHERE textdata LIKE '%[_]%'
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6_6_6_*"' ) ct ON ct.[key] = ft_test.id WHERE textdata LIKE '%[_]%'
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6_6_6*"' ) ct ON ct.[key] = ft_test.id WHERE textdata LIKE '%[_]%'
SELECT * FROM ft_test LEFT JOIN CONTAINSTABLE(ft_test, *, '"6_6_*"' ) ct ON ct.[key] = ft_test.id WHERE textdata LIKE '%[_]%'
用下劃線
下劃線更換週期是要走的路。它被視爲一個字符,而不是標點符號。 Sql Server可以在計算列上創建全文索引。這將允許我們使用公式來「修復」數據,對其進行索引,並在沒有額外存儲的情況下進行查詢(並且以最小的開銷)。您需要修改應用程序以查詢「1_2_3」而不是「1.2.3」。
--naive implementation
ALTER TABLE ft_test ADD [TextData_FT1] AS ([textdata]+' '+replace([TextData],'.','_'))
--strip all characters. You can customize to get pull out only the paragraph numbers
ALTER TABLE ft_test ADD [TextData_FT2] AS (REPLACE(REPLACE(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(UPPER([TextData])
,'A', ' '),'B', ' '),'C', ' '),'D', ' '),'E', ' '),'F', ' '),
'G', ' '),'H', ' '),'I', ' '),'J', ' '),'K', ' '),'L', ' '),'M', ' '),'N', ' '),
'O', ' '),'P', ' '),'Q', ' '),'R', ' '),'S', ' '),'T', ' '),'U', ' '),'V', ' '),
'W', ' '),'X', ' '),'Y', ' '),'Z', ' '), '.','_') , ' ',' ')
)
--Add computed columns to FT index
ALTER FULLTEXT INDEX ON [dbo].[FT_Test] ADD ([TextData_FT1])
ALTER FULLTEXT INDEX ON [dbo].[FT_Test] ADD ([TextData_FT2])
DELETE FROM dbo.FT_Test
INSERT INTO dbo.FT_Test (TextData)
VALUES ('1 This is the chapter title')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('1.1 Section heading')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('1.1.1 paragraph 1 is very interesting')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('1.1.2 paragraph two is better')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('1.2 Another Section')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('1.2.1 Foobar qwerty loren ipsum')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('1.2.2 Foobar2 qwerty2 loren ipsum 12 items ')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('1.2.12 Foobar2 qwerty2 loren ipsum ')
INSERT INTO dbo.FT_Test (TextData)
VALUES ('2.2.17 sql server is great. ')
--naive implementation
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft1, '"1_1*"')
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft1, '1*')
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft1, '2*') --
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft1, '"1_1_2*"')
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft1, '1_1_2*')
--only index the paragraph identifiers
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft2, '"1_1*"')
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft2, '1*')
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft2, '2*') --
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft2, '"1_1_2*"')
SELECT * FROM ft_Test WHERE CONTAINS(TextData_ft2, '1_1_2*')
謝謝。但確實看起來有點貴。是否真的沒有辦法讓全文引擎知道如何對待這個點的不同? – user250773
寫你自己的解析器(你不想做的,相信我),沒有。這種方法並沒有真正增加那麼多開銷,因爲替換字符的公式只在行被索引時才被評估,而不是在被搜索時被評估。解析器是Windows搜索功能的一部分,適用於電子郵件消息,文檔文檔等。對於特定於域的索引,它們有缺陷。 – StrayCatDBA