2011-08-02 31 views
3

我有一張表,裏面有我必須隨機分配的數據。通過隨機化,我的意思是使用來自隨機行的數據更新同一列中的另一行。問題在於表格本身很大(超過2000000行)。在Sql Server表中有效地隨機化(混洗)數據

我寫了一段使用while循環的代碼,但速度很慢。

有沒有人有任何關於實現隨機化的更有效方法的建議?

+1

向我們展示您的代碼plz? –

+0

基本上,我從表中選擇所有ID-s到臨時表中,然後從該表中選擇一個ID,從一些隨機行中查找值並更新它。之後,我從臨時表中刪除該ID。我使用 「代碼」 選擇頂部1 MyColumn FROM MyTable其中Id> = RAND()* NumberOfRowsInTable – Milhad

回答

3

爲了更新行,從更新中將會有顯着的處理時間(CPU + I/O)。

您是否測量了隨機化行與執行更新的相對費用?

在所有你需要做的只是選擇隨機行,這裏是一個有效的方法來挑選行的隨機抽樣(在這種情況下,各行的1%)

SELECT * FROM myTable 
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), pkID) & 0x7fffffff AS float)/CAST (0x7fffffff AS int) 

其中pkID是你的主鍵列。

此信息可能會感興趣:使每行10個值與來自其他行其他值取而代之的將是昂貴

+0

我有私人用戶數據(名字,姓氏,VIN等)在該表中我想要洗牌,以便沒有人能找到真實的信息,我可以使用該數據庫進行測試。隨機化整行不會有所幫助,我需要完全更新行 – Milhad

2

要在10列洗牌數據。

你必須讀取2百萬行10次。

的SELECT將

SELECT 
    FirstName, LastName, VIN, ... 
FROM 
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName 
    JOIN 
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1 
    JOIN 
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1 
    JOIN 
    ... 

我也不會更新,我想創建一個新表

SELECT 
    FirstName, LastName, VIN, ... 
INTO 
    StagingTable 
FROM 
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName 
    JOIN 
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1 
    JOIN 
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1 
    JOIN 
    ... 

然後添加鍵等,刪除舊錶,將其重命名。或使用SYNONYM指向新表

如果要更新,那麼我不喜歡這樣。或將其分解成10個更新。

UPDATE 
    M 
SET 
    Firstname = FirstName.FirstName, 
    LastName = LastName.LastName, 
    ... 
FROM 
    MyTable M 
    JOIN 
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName ON 1=1 
    JOIN 
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1 
    JOIN 
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1 
    JOIN 
    ... 
+0

在SQL Server 2008中(至少),這些連接失敗並顯示錯誤1033:「ORDER BY子句在視圖,內聯函數,派生表,子查詢,和公共表表達式,除非TOP或FOR XML也被指定。「 –

+0

@ T.J.Crowder這是一個不同的問題。爲什麼在視圖中使用ORDER BY? – gbn

+0

@ gbn:你在上面的連接中使用它('ORDER BY NEWID()'),大概是在混合東西時得到隨機的足夠順序。我只是試圖精確地執行上面顯示的更新,只是將其更改爲表和列名稱。最終我設法使用公用表子句來代替。效率不高,但完成了工作。 –

2

基於米奇小麥答案鏈接到這個article on scrambline data你可以做這樣的事情爭奪一堆領域,你不只是限於標識:

;WITH Randomize AS 
( 
SELECT ROW_NUMBER() OVER (ORDER BY [UserID]) AS orig_rownum, 
     ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum, 
     * 
FROM [UserTable] 
) 
UPDATE T1 
    SET [UserID] = T2.[UserID] 
     ,[FirstName] = T2.[FirstName] 
     ,[LastName] = T2.[LastName] 
     ,[AddressLine1] = T2.[AddressLine1] 
     ,[AddressLine2] = T2.[AddressLine2] 
     ,[AddressLine3] = T2.[AddressLine3] 
     ,[City] = T2.[City] 
     ,[State] = T2.[State] 
     ,[Pincode] = T2.[Pincode] 
     ,[PhoneNumber] = T2.[PhoneNumber] 
     ,[MobileNumber] = T2.[MobileNumber] 
     ,[Email] = T2.[Email] 
     ,[Status] = T2.[Status] 
FROM Randomize T1 
     join Randomize T2 on T1.orig_rownum = T2.new_rownum 
; 

所以你不是活得t僅限於這樣做,因爲文章顯示:

;WITH Randomize AS 
( 
SELECT ROW_NUMBER() OVER (ORDER BY Id) AS orig_rownum, 
     ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum, 
     * 
FROM [MyTable] 
) 
UPDATE T1 SET Id = T2.Id 
FROM Randomize T1 
     join Randomize T2 on T1.orig_rownum = T2.new_rownum 
; 

此方法的危險是您調整的數據量。使用CTE將所有這些內容都存入內存,所以雖然我發現這很快(對於500k行表,19秒)。如果您有一張擁有數百萬條記錄的表格,您將需要非常小心。你應該考慮實際需要多少數據,或者是一個好的人口樣本,用於測試和開發。

1

我結合我在隨機化每列再次完全隨機記錄

UPDATE MyTable SET 
    columnA = columnA.newValue, 
    columnB = columnB.newValue, 
    -- Optionally, for maintaining a group of values like street, zip, city in an address 
    columnC = columnGroup.columnC, 
    columnD = columnGroup.columnD, 
    columnE = columnGroup.columnE 
FROM MyTable 
INNER JOIN (
    SELECT ROW_NUMBER() OVER (ORDER BY id) AS rn, id FROM MyTable 
) AS PKrows ON MyTable.id = PKrows.id 
-- repeat the following JOIN for each column you want to randomize 
INNER JOIN (
    SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnA AS newValue FROM MyTable 
) AS columnA ON PKrows.rn = columnA.rn 
INNER JOIN (
    SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnB AS newValue FROM MyTable 
) AS columnB ON PKrows.rn = columnB.rn 

-- Optionally, if you want to maintain a group of values spread out over several columns 
INNER JOIN (
    SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnC, columnD, columnE FROM MyTable 
) AS columnGroup ON PKrows.rn = columnGroup.rn 

此查詢歷時8秒一個10K行表在Windows洗牌8列結束了一個查詢上面找到答案2008 R2機器,內存16GB,內含4個XEON內核@ 2.93GHz