Sql UDF優化

我已經編寫了以下兩個字符串（以逗號分隔）的函數，將它們分成兩個不同的臨時表，然後使用這些臨時表來查找這兩個臨時表中匹配的單詞的百分比。問題是，當我在每行約200k行的數據集上使用它時，查詢超時！是否有任何可以完成的優化？Sql UDF優化

ALTER FUNCTION [GetWordSimilarity](@String varchar(8000), 
@String2 varchar(8000),@Delimiter char(1)) 
returns decimal(16,2)   
as   
begin   
declare @result as decimal (16,2) 
declare @temptable table (items varchar(8000))   
declare @temptable2 table (items varchar(8000)) 
declare @numberOfCommonWords decimal(16,2) 
declare @countTable1 decimal(16,2) 
declare @countTable2 decimal(16,2) 
declare @denominator decimal(16,2) 
set @result = 0.0 --dummy value 
declare @idx int   
declare @slice varchar(8000)   

select @idx = 1   
    if len(@String)<1 or @String is null or len(@String2) = 0 or @String2 is null return 0.0 

--populating @temptable 
while @idx!= 0   
begin   
    set @idx = charindex(@Delimiter,@String)   
    if @idx!=0   
     set @slice = left(@String,@idx - 1) 
    else   
     set @slice = @String 

    if(len(@slice)>0) 
     insert into @temptable(Items) values(ltrim(rtrim(@slice)))   

    set @String = right(@String,len(@String) - @idx)   
    if len(@String) = 0 break   
end  

select @idx = 1 

----populating @temptable2 
while @idx!= 0   
begin   
    set @idx = charindex(@Delimiter,@String2)   
    if @idx!=0   
     set @slice = left(@String2,@idx - 1) 
    else   
     set @slice = @String2 

    if(len(@slice)>0) 
     insert into @temptable2(Items) values(ltrim(rtrim(@slice)))   

    set @String2 = right(@String2,len(@String2) - @idx)   
    if len(@String2) = 0 break   
end  

--calculating percentage of words match 
if (((select COUNT(*) from @temptable) = 0) or ((select COUNT(*) from @temptable2) = 0)) 
    return 0.0 

select @numberOfCommonWords = COUNT(*) from 
(
    select distinct items from @temptable 
    intersect 
    select distinct items from @temptable2 
) a 

select @countTable1 = COUNT (*) from @temptable 
select @countTable2 = COUNT (*) from @temptable2 

if(@countTable1 > @countTable2) set @denominator = @countTable1 
else set @denominator = @countTable2 

set @result = @numberOfCommonWords/@denominator 

return @result 
end

非常感謝！

來源

2014-04-03 Kumar Vaibhav

您可以使用數字表優化分割。在sqlcentral上有一個例子，並解釋它如何工作。我會看看我能否找到它。 – Dreamwalker

謝謝！請讓我知道你是否可以找到它。再次感謝！ –

我找到了鏈接http://www.sqlservercentral.com/articles/T-SQL/62867/ Essentialy你用數字製作一個表格並使用它來執行循環。文章是在這裏渴望的方式。 – Dreamwalker

有沒有辦法編寫一個T SQL UDF與大量的字符串操作裏面，將行爲良好的大量的行。你會得到一些收穫，如果您使用的號碼錶，雖然：

declare 
    @col_list varchar(1000), 
    @sep char(1) 

set @col_list = 'TransactionID, ProductID, ReferenceOrderID, ReferenceOrderLineID, TransactionDate, TransactionType, Quantity, ActualCost, ModifiedDate' 
set @sep = ',' 

select substring(@col_list, n, charindex(@sep, @col_list + @sep, n) - n) 
from numbers where substring(@sep + @col_list, n, 1) = @sep 
and n < len(@col_list) + 1

你的行動最好的辦法是寫在SQLCLR整個事情。

來源

2014-04-03 13:16:43 dean

SQLCLR將如何提供幫助？ –

.NET Framework中的字符串操作函數比T SQL對應函數快得多。 – dean

當然問題在於設計。您不應該將逗號分隔的數據存儲在SQL數據庫中以開始。但是，我想我們現在一直堅持下去。要考慮的一件事是將函數轉換爲SQLCLR; SQL本身對字符串操作不太好。（好吧，其實，沒有語言好與字符串操作恕我直言，但SQL真的是在它壞=）

用於填充@Temptables 1 & 2可以通過使用從傑夫MODEN是誰寫的代碼進行優化分離器幾個神奇的文章，其中最後一個可以在這裏找到：http://www.sqlservercentral.com/articles/Tally+Table/72993/

以他的分離器+優化其餘的代碼有點我設法從一個200K隨機數據樣本從771秒到305秒。有些事情要注意：結果不完全相同。我手動進行了一些檢查，實際上我認爲新的結果更準確，但是沒有時間在兩個版本上都花費時間。

我試圖將其轉換爲更多基於集合的方法，我首先在表中包含所有row_id的所有單詞的所有單詞，然後將它們重新組合在一起。儘管加入速度非常快，但創建初始表格的時間太長，因此甚至會失去原始功能。

也許我會試圖找出另一種方法來使其更快，但現在我希望這會幫助你一點點。

ALTER FUNCTION [GetWordSimilarity2](@String1 varchar(8000), 
@String2 varchar(8000),@Delimiter char(1)) 
returns decimal(16,2)   
as   
begin   
declare @temptable1 table (items varchar(8000), row_id int IDENTITY(1, 1), PRIMARY KEY (items, row_id))   
declare @temptable2 table (items varchar(8000), row_id int IDENTITY(1, 1), PRIMARY KEY (items, row_id)) 
declare @numberOfCommonWords decimal(16,2) 
declare @countTable1 decimal(16,2) 
declare @countTable2 decimal(16,2) 

-- based on code from Jeff Moden (http://www.sqlservercentral.com/articles/Tally+Table/72993/) 

--populating @temptable1 
;WITH E1(N) AS (
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 
       ),       --10E+1 or 10 rows 
     E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows 
     E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max 
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front 
        -- for both a performance gain and prevention of accidental "overruns" 
       SELECT TOP (ISNULL(DATALENGTH(@String1),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4 
       ), 
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter) 
       SELECT 1 UNION ALL 
       SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@String1,t.N,1) = @Delimiter 
       ), 
cteLen(N1,L1) AS(--==== Return start and length (for use in substring) 
       SELECT s.N1, 
         ISNULL(NULLIF(CHARINDEX(@Delimiter,@String1,s.N1),0)-s.N1,8000) 
        FROM cteStart s 
       ) 
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found. 
INSERT @temptable1 (items) 
SELECT Item  = SUBSTRING(@String1, l.N1, l.L1) 
    FROM cteLen l 

SELECT @countTable1 = @@ROWCOUNT 

----populating @temptable2 
;WITH E1(N) AS (
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 
       ),       --10E+1 or 10 rows 
     E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows 
     E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max 
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front 
        -- for both a performance gain and prevention of accidental "overruns" 
       SELECT TOP (ISNULL(DATALENGTH(@String2),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4 
       ), 
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter) 
       SELECT 1 UNION ALL 
       SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@String2,t.N,1) = @Delimiter 
       ), 
cteLen(N1,L1) AS(--==== Return start and length (for use in substring) 
       SELECT s.N1, 
         ISNULL(NULLIF(CHARINDEX(@Delimiter,@String2,s.N1),0)-s.N1,8000) 
        FROM cteStart s 
       ) 
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found. 
INSERT @temptable2 (items) 
SELECT Item  = SUBSTRING(@String2, l.N1, l.L1) 
    FROM cteLen l 

SELECT @countTable2 = @@ROWCOUNT 

--calculating percentage of words match 
if @countTable1 = 0 OR @countTable2 = 0 
    return 0.0 

select @numberOfCommonWords = COUNT(DISTINCT t1.items) 
    from @temptable1 t1 
    JOIN @temptable2 t2 
    ON t1.items = t2.items 


RETURN @numberOfCommonWords/(CASE WHEN (@countTable1 > @countTable2) THEN @countTable1 ELSE @countTable2 END) 

end

來源

2014-04-03 21:53:25 deroby

感謝您的詳細解釋！我現在正在測試函數的各種版本（包括Sql CLR）。一旦我完成，會讓你知道。再次感謝：） –

回答

相關問題