2014-04-03 116 views
1

我已經編寫了以下兩個字符串(以逗號分隔)的函數,將它們分成兩個不同的臨時表,然後使用這些臨時表來查找這兩個臨時表中匹配的單詞的百分比。問題是,當我在每行約200k行的數據集上使用它時,查詢超時! 是否有任何可以完成的優化?Sql UDF優化

ALTER FUNCTION [GetWordSimilarity](@String varchar(8000), 
@String2 varchar(8000),@Delimiter char(1)) 
returns decimal(16,2)   
as   
begin   
declare @result as decimal (16,2) 
declare @temptable table (items varchar(8000))   
declare @temptable2 table (items varchar(8000)) 
declare @numberOfCommonWords decimal(16,2) 
declare @countTable1 decimal(16,2) 
declare @countTable2 decimal(16,2) 
declare @denominator decimal(16,2) 
set @result = 0.0 --dummy value 
declare @idx int   
declare @slice varchar(8000)   

select @idx = 1   
    if len(@String)<1 or @String is null or len(@String2) = 0 or @String2 is null return 0.0 

--populating @temptable 
while @idx!= 0   
begin   
    set @idx = charindex(@Delimiter,@String)   
    if @idx!=0   
     set @slice = left(@String,@idx - 1) 
    else   
     set @slice = @String 

    if(len(@slice)>0) 
     insert into @temptable(Items) values(ltrim(rtrim(@slice)))   

    set @String = right(@String,len(@String) - @idx)   
    if len(@String) = 0 break   
end  

select @idx = 1 

----populating @temptable2 
while @idx!= 0   
begin   
    set @idx = charindex(@Delimiter,@String2)   
    if @idx!=0   
     set @slice = left(@String2,@idx - 1) 
    else   
     set @slice = @String2 

    if(len(@slice)>0) 
     insert into @temptable2(Items) values(ltrim(rtrim(@slice)))   

    set @String2 = right(@String2,len(@String2) - @idx)   
    if len(@String2) = 0 break   
end  

--calculating percentage of words match 
if (((select COUNT(*) from @temptable) = 0) or ((select COUNT(*) from @temptable2) = 0)) 
    return 0.0 

select @numberOfCommonWords = COUNT(*) from 
(
    select distinct items from @temptable 
    intersect 
    select distinct items from @temptable2 
) a 

select @countTable1 = COUNT (*) from @temptable 
select @countTable2 = COUNT (*) from @temptable2 

if(@countTable1 > @countTable2) set @denominator = @countTable1 
else set @denominator = @countTable2 

set @result = @numberOfCommonWords/@denominator 

return @result 
end 

非常感謝!

+1

您可以使用數字表優化分割。在sqlcentral上有一個例子,並解釋它如何工作。我會看看我能否找到它。 – Dreamwalker

+0

謝謝!請讓我知道你是否可以找到它。再次感謝! –

+1

我找到了鏈接http://www.sqlservercentral.com/articles/T-SQL/62867/ Essentialy你用數字製作一個表格並使用它來執行循環。文章是在這裏渴望的方式。 – Dreamwalker

回答

1

有沒有辦法編寫一個T SQL UDF與大量的字符串操作裏面,將行爲良好的大量的行。你會得到一些收穫,如果您使用的號碼錶,雖然:

declare 
    @col_list varchar(1000), 
    @sep char(1) 

set @col_list = 'TransactionID, ProductID, ReferenceOrderID, ReferenceOrderLineID, TransactionDate, TransactionType, Quantity, ActualCost, ModifiedDate' 
set @sep = ',' 

select substring(@col_list, n, charindex(@sep, @col_list + @sep, n) - n) 
from numbers where substring(@sep + @col_list, n, 1) = @sep 
and n < len(@col_list) + 1 

你的行動最好的辦法是寫在SQLCLR整個事情。

+0

SQLCLR將如何提供幫助? –

+0

.NET Framework中的字符串操作函數比T SQL對應函數快得多。 – dean

1

當然問題在於設計。您不應該將逗號分隔的數據存儲在SQL數據庫中以開始。 但是,我想我們現在一直堅持下去。 要考慮的一件事是將函數轉換爲SQLCLR; SQL本身對字符串操作不太好。 (好吧,其實,沒有語言好與字符串操作恕我直言,但SQL真的是在它壞=)

用於填充@Temptables 1 & 2可以通過使用從傑夫MODEN是誰寫的代碼進行優化分離器幾個神奇的文章,其中最後一個可以在這裏找到:http://www.sqlservercentral.com/articles/Tally+Table/72993/

以他的分離器+優化其餘的代碼有點我設法從一個200K隨機數據樣本從771秒到305秒。 有些事情要注意:結果不完全相同。我手動進行了一些檢查,實際上我認爲新的結果更準確,但是沒有時間在兩個版本上都花費時間。

我試圖將其轉換爲更多基於集合的方法,我首先在表中包含所有row_id的所有單詞的所有單詞,然後將它們重新組合在一起。儘管加入速度非常快,但創建初始表格的時間太長,因此甚至會失去原始功能。

也許我會試圖找出另一種方法來使其更快,但現在我希望這會幫助你一點點。

ALTER FUNCTION [GetWordSimilarity2](@String1 varchar(8000), 
@String2 varchar(8000),@Delimiter char(1)) 
returns decimal(16,2)   
as   
begin   
declare @temptable1 table (items varchar(8000), row_id int IDENTITY(1, 1), PRIMARY KEY (items, row_id))   
declare @temptable2 table (items varchar(8000), row_id int IDENTITY(1, 1), PRIMARY KEY (items, row_id)) 
declare @numberOfCommonWords decimal(16,2) 
declare @countTable1 decimal(16,2) 
declare @countTable2 decimal(16,2) 

-- based on code from Jeff Moden (http://www.sqlservercentral.com/articles/Tally+Table/72993/) 

--populating @temptable1 
;WITH E1(N) AS (
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 
       ),       --10E+1 or 10 rows 
     E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows 
     E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max 
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front 
        -- for both a performance gain and prevention of accidental "overruns" 
       SELECT TOP (ISNULL(DATALENGTH(@String1),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4 
       ), 
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter) 
       SELECT 1 UNION ALL 
       SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@String1,t.N,1) = @Delimiter 
       ), 
cteLen(N1,L1) AS(--==== Return start and length (for use in substring) 
       SELECT s.N1, 
         ISNULL(NULLIF(CHARINDEX(@Delimiter,@String1,s.N1),0)-s.N1,8000) 
        FROM cteStart s 
       ) 
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found. 
INSERT @temptable1 (items) 
SELECT Item  = SUBSTRING(@String1, l.N1, l.L1) 
    FROM cteLen l 

SELECT @countTable1 = @@ROWCOUNT 

----populating @temptable2 
;WITH E1(N) AS (
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 
       ),       --10E+1 or 10 rows 
     E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows 
     E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max 
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front 
        -- for both a performance gain and prevention of accidental "overruns" 
       SELECT TOP (ISNULL(DATALENGTH(@String2),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4 
       ), 
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter) 
       SELECT 1 UNION ALL 
       SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@String2,t.N,1) = @Delimiter 
       ), 
cteLen(N1,L1) AS(--==== Return start and length (for use in substring) 
       SELECT s.N1, 
         ISNULL(NULLIF(CHARINDEX(@Delimiter,@String2,s.N1),0)-s.N1,8000) 
        FROM cteStart s 
       ) 
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found. 
INSERT @temptable2 (items) 
SELECT Item  = SUBSTRING(@String2, l.N1, l.L1) 
    FROM cteLen l 

SELECT @countTable2 = @@ROWCOUNT 

--calculating percentage of words match 
if @countTable1 = 0 OR @countTable2 = 0 
    return 0.0 

select @numberOfCommonWords = COUNT(DISTINCT t1.items) 
    from @temptable1 t1 
    JOIN @temptable2 t2 
    ON t1.items = t2.items 


RETURN @numberOfCommonWords/(CASE WHEN (@countTable1 > @countTable2) THEN @countTable1 ELSE @countTable2 END) 

end 
+0

感謝您的詳細解釋!我現在正在測試函數的各種版本(包括Sql CLR)。一旦我完成,會讓你知道。再次感謝 :) –