2016-02-04 80 views
3

我有一個SQL表,其中存儲了大字符串值,該值必須是唯一的。 爲了確保唯一性,我在一個列中存儲唯一的索引,其中存儲大字符串的MD5哈希的字符串表示形式。計算一個UTF8字符串的MD5哈希

,節省了這些記錄的C#應用​​程序使用下面的方法做散列:

public static string CreateMd5HashString(byte[] input) 
{ 
    var hashBytes = MD5.Create().ComputeHash(input); 
    return string.Join("", hashBytes.Select(b => b.ToString("X"))); 
} 

爲了調用此,我首先使用UTF-8編碼轉換stringbyte[]

// this is what I use in my app 
CreateMd5HashString(Encoding.UTF8.GetBytes("abc")) 
// result: 90150983CD24FB0D6963F7D28E17F72 

現在我想能夠在SQL中實現這個散列函數,使用HASHBYTES function,但我得到一個不同的值:

print hashbytes('md5', N'abc') 
-- result: 0xCE1473CF80C6B3FDA8E3DFC006ADC315 

這是因爲SQL計算字符串的UTF-16表示形式的MD5。 我在C#中得到了相同的結果,如果我這樣做CreateMd5HashString(Encoding.Unicode.GetBytes("abc"))

我不能改變哈希在應用程序中完成的方式。

有沒有辦法讓SQL Server計算字符串的UTF-8字節的MD5哈希?

我查了類似的問題,我嘗試使用排序規則,但迄今沒有運氣。

+0

我只是昨晚還挺同樣的事情..我想你用它來存儲密碼和檢查登錄...爲什麼你不改變你的邏輯讓C#使用MD5並再次將其轉換爲散列,然後檢查它是否存儲在數據庫中的字符串相同? – Veljko89

+1

@ Veljko89 MD5 [不適用](http://security.stackexchange.com/a/19908/4304)用於密碼。我建議你避免它。 – GolfWolf

+0

但是要在任何網站上進行實際測試,有防禦措施,5次嘗試後超時等等......沒有網站可以處理這些登錄量。甚至要找到某個人的密碼,任何人都有機會找出20個字符串作爲鹽添加? – Veljko89

回答

5

您需要創建一個UDF將NVARCHAR數據轉換爲UTF-8表示形式的字節。說它是叫dbo.NCharToUTF8Binary那麼你可以做:

hashbytes('md5', dbo.NCharToUTF8Binary(N'abc', 1)) 

這裏是一個UDF,將做到這一點:

create function dbo.NCharToUTF8Binary(@txt NVARCHAR(max), @modified bit) 
returns varbinary(max) 
as 
begin 
-- Note: This is not the fastest possible routine. 
-- If you want a fast routine, use SQLCLR 
    set @modified = isnull(@modified, 0) 
    -- First shred into a table. 
    declare @chars table (
    ix int identity primary key, 
    codepoint int, 
    utf8 varbinary(6) 
    ) 
    declare @ix int 
    set @ix = 0 
    while @ix < datalength(@txt)/2 -- trailing spaces 
    begin 
     set @ix = @ix + 1 
     insert @chars(codepoint) 
     select unicode(substring(@txt, @ix, 1)) 
    end 

    -- Now look for surrogate pairs. 
    -- If we find a pair (lead followed by trail) we will pair them 
    -- High surrogate is \uD800 to \uDBFF 
    -- Low surrogate is \uDC00 to \uDFFF 
    -- Look for high surrogate followed by low surrogate and update the codepoint 
    update c1 set codepoint = ((c1.codepoint & 0x07ff) * 0x0800) + (c2.codepoint & 0x07ff) + 0x10000 
    from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1 
    where c1.codepoint >= 0xD800 and c1.codepoint <=0xDBFF 
    and c2.codepoint >= 0xDC00 and c2.codepoint <=0xDFFF 
    -- Get rid of the trailing half of the pair where found 
    delete c2 
    from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1 
    where c1.codepoint >= 0x10000 

    -- Now we utf-8 encode each codepoint. 
    -- Lone surrogate halves will still be here 
    -- so they will be encoded as if they were not surrogate pairs. 
    update c 
    set utf8 = 
    case 
    -- One-byte encodings (modified UTF8 outputs zero as a two-byte encoding) 
    when codepoint <= 0x7f and (@modified = 0 OR codepoint <> 0) 
    then cast(substring(cast(codepoint as binary(4)), 4, 1) as varbinary(6)) 
    -- Two-byte encodings 
    when codepoint <= 0x07ff 
    then substring(cast((0x00C0 + ((codepoint/0x40) & 0x1f)) as binary(4)),4,1) 
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) 
    -- Three-byte encodings 
    when codepoint <= 0x0ffff 
    then substring(cast((0x00E0 + ((codepoint/0x1000) & 0x0f)) as binary(4)),4,1) 
    + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1) 
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) 
    -- Four-byte encodings 
    when codepoint <= 0x1FFFFF 
    then substring(cast((0x00F0 + ((codepoint/0x00040000) & 0x07)) as binary(4)),4,1) 
    + substring(cast((0x0080 + ((codepoint/0x1000) & 0x3f)) as binary(4)),4,1) 
    + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1) 
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) 

    end 
    from @chars c 

    -- Finally concatenate them all and return. 
    declare @ret varbinary(max) 
    set @ret = cast('' as varbinary(max)) 
    select @ret = @ret + utf8 from @chars c order by ix 
    return @ret 

end 
1

SQL Server本身不支持使用UTF-8字符串和it hasn't for quite a while。正如你注意到的那樣,NCHAR and NVARCHAR use UCS-2 rather than UTF-8

如果您堅持使用HASHBYTES函數,您必須能夠從C#代碼中傳遞爲VARBINARY的UTF-8 byte[]以保留編碼。 HASHBYTES accepts VARBINARY in place of NVARCHAR.這可以通過CLR函數來完成,該函數接受NVARCHAR並將Encoding.UTF8.GetBytes的結果返回爲VARBINARY

就這麼說,我強烈建議保持這些類型的業務規則隔離在您的應用程序,而不是數據庫。特別是因爲應用程序已經在執行這個邏輯。