同意,CLR
解決方案應該會更快。 更多,我不認爲SQL Server
應該完成這項任務。您可以編寫客戶端應用程序(VB.NET,C#等)或應該完成此任務的PowerShell腳本。
如果你想要一個T-SQL
唯一的解決辦法(請閱讀上面的段落,再次),然後看看這個查詢(至少SQL Server 2005中):
CREATE TABLE dbo.TestData
(
ID INT IDENTITY(1,1) PRIMARY KEY
,SomeText NVARCHAR(MAX) NOT NULL
);
INSERT dbo.TestData
SELECT 'Here you can visit <a href="http://www.thisite.com">this link</a> or this <a href="http://www.newsite.com">new link</a>'
UNION ALL
SELECT '<div class="tagged">
<a href="https://stackoverflow.com/questions/tagged/string" class="post-tag">string</a>
<span class="item-multiplier">× 16364</span><br>
<a href="https://stackoverflow.com/questions/tagged/tsql" class="post-tag">tsql</a>
<span class="item-multiplier">× 10304</span><br>
<a href="https://stackoverflow.com/questions/tagged/substring" class="post-tag">substring</a><acronym title="as soon as possible">ASAP</acronym>';
WITH ParseAnchorTags
AS
(
SELECT a.ID
,SUBSTRING(a.SomeText, CHARINDEX('<a ',a.SomeText), CHARINDEX('</a>',a.SomeText)-CHARINDEX('<a ',a.SomeText)+4) AS Txt
,CHARINDEX('</a>',a.SomeText)+3 AS LastIndex
FROM dbo.TestData a
UNION ALL
SELECT a.ID
,SUBSTRING(a.SomeText, CHARINDEX('<a ',a.SomeText,prev.LastIndex+1), CHARINDEX('</a>',a.SomeText,prev.LastIndex+1)-CHARINDEX('<a ',a.SomeText,prev.LastIndex+1)+4) AS Txt
,CHARINDEX('</a>',a.SomeText,prev.LastIndex+1)+3 AS LastIndex
FROM dbo.TestData a
INNER JOIN ParseAnchorTags prev ON a.ID=prev.ID
AND CHARINDEX('<a ',a.SomeText,prev.LastIndex+1) > 0
)
SELECT *
FROM ParseAnchorTags cte
ORDER BY cte.ID, cte.LastIndex;
DROP TABLE dbo.TestData;
結果:
ID Txt
----------- --------------------------------------------------------------------
1 <a href="http://www.thisite.com">this link</a>
1 <a href="http://www.newsite.com">new link</a>
2 <a href="https://stackoverflow.com/questions/tagged/string" class="post-tag">string</a>
2 <a href="https://stackoverflow.com/questions/tagged/tsql" class="post-tag">tsql</a>
2 <a href="https://stackoverflow.com/questions/tagged/substring" class="post-tag">substring</a>
我同意CLR的使用,但正則表達式不能用*可靠*來解析html。對於一個很好的閱讀,看看[你不能解析HTML與正則表達式](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-標籤)... *
* HTML是一種足夠複雜的語言,它不能被正則表達式解析。即使Jon Skeet也不能使用正則表達式解析HTML * –
我同意,但作者不需要**解析HTML **。由於問題很簡單,「在開啓href標籤和關閉href標籤之間提取所有內容」--KISS原則在這裏應該工作得很好。 –