2008-09-12 105 views
21

如何根據所有候選行的應用權重在T-SQL中隨機選擇一個錶行?T-SQL中的隨機加權選擇

例如,我有一個表格中的一組行,其權重分別爲50,25和25(合計爲100,但不需要),我想隨機選擇其中的一個統計結果相當於相應的重量。

回答

15

丹恩的答案包括一個引入平方律的自我連接。 (n*n/2)行之後的表中有n行。

更理想的是能夠解析表格一次。

DECLARE @id int, @weight_sum int, @weight_point int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = FLOOR(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT 
    @id = CASE WHEN @weight_point < 0 THEN @id ELSE [table].id END, 
    @weight_point = @weight_point - [table].weight 
FROM 
    @table [table] 
ORDER BY 
    [table].Weight DESC 

這將通過表格,設置@id每個記錄的id值,而在同一時間遞減@weight點。最終,@weight_point將消極。這意味着前面所有權重的SUM大於隨機選擇的目標值。這是我們想要的記錄,因此從那時起,我們將@id設置爲自己(忽略表中的任何ID)。

這隻會貫穿表格一次,但即使所選值是第一條記錄,也必須貫穿整個表格。因爲平均倉位是通過表的一半(和更少,如果下令升重量)編寫循環也可能會被更快......(特別是如果該權重是在公共組):

DECLARE @id int, @weight_sum int, @weight_point int, @next_weight int, @row_count int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT @next_weight = MAX(weight) FROM @table 
SELECT @row_count = COUNT(*) FROM @table 
SET @weight_point = @weight_point - (@next_weight * @row_count) 

WHILE (@weight_point > 0) 
BEGIN 
    SELECT @next_weight = MAX(weight) FROM @table WHERE weight < @next_weight 
    SELECT @row_count = COUNT(*) FROM @table WHERE weight = @next_weight 
    SET @weight_point = @weight_point - (@next_weight * @row_count) 
END 

-- # Once the @weight_point is less than 0, we know that the randomly chosen record 
-- # is in the group of records WHERE [table].weight = @next_weight 

SELECT @row_count = FLOOR(((@row_count - 1) * RAND() + 1), 0) 

SELECT 
    @id = CASE WHEN @row_count < 0 THEN @id ELSE [table].id END, 
    @row_count = @row_count - 1 
FROM 
    @table [table] 
WHERE 
    [table].weight = @next_weight 
ORDER BY 
    [table].Weight DESC 
+0

我做了一些經驗性測試,發現您的解決方案對輸入數據非常敏感。我的測試數據 - 權重:2,998,迭代次數:1M。重量2應該拿起約2k次。如果表中權重的順序是升序(2,998),那麼它的權重2就是大約500次。如果您反轉訂單,則約爲2500次。如果您將`ROUND`更改爲`FLOOR`,升序約爲1000次,重量降低約2000倍。這是適當的概率。我已經更新了你的答案。 – 2016-09-01 19:01:46

7

您只需要對所有準備行的權重求和,然後在該總和中選擇一個隨機點,然後選擇與該選定點協調的記錄(每個記錄遞增地攜帶累加權重和)。

DECLARE @id int, @weight_sum int, @weight_point int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT TOP 1 @id = t1.id 
FROM @table t1, @table t2 
WHERE t1.id >= t2.id 
GROUP BY t1.id 
HAVING SUM(t2.weight) >= @weight_point 
ORDER BY t1.id 

SELECT @id 
+1

我做了一個小您和MatBailie解決方案的基準,看起來您的時間大約是Mat的兩倍。在2行和1千萬次迭代的桌子上,Mat的解決方案耗時約45秒,解決方案約需85秒。 – 2016-09-01 18:40:50

3

「逐步揹着一個accumlating [原文]加權求和」部分是昂貴的,如果你有很多的記錄。如果你已經有了很大的分數/重量範圍(例如:範圍足夠寬,大多數記錄重量是獨一無二的,1-5顆星可能不會削減它),你可以做這樣的事情來選擇一個重量值。我使用VB.Net在這裏展示,但是這可以很容易地在純SQL來完成,以及:

Function PickScore() 
    'Assume we have a database wrapper class instance called SQL and seeded a PRNG already 
    'Get count of scores in database 
    Dim ScoreCount As Double = SQL.ExecuteScalar("SELECT COUNT(score) FROM [MyTable]") 
    ' You could also approximate this with just the number of records in the table, which might be faster. 

    'Random number between 0 and 1 with ScoreCount possible values 
    Dim rand As Double = Random.GetNext(ScoreCount)/ScoreCount 

    'Use the equation y = 1 - x^3 to skew results in favor of higher scores 
    ' For x between 0 and 1, y is also between 0 and 1 with a strong bias towards 1 
    rand = 1 - (rand * rand * rand) 

    'Now we need to map the (0,1] vector to [1,Maxscore]. 
    'Just find MaxScore and mutliply by rand 
    Dim MaxScore As UInteger = SQL.ExecuteScalar("SELECT MAX(Score) FROM Songs") 
    Return MaxScore * rand 
End Function 

運行它,並選擇具有最大記錄得分比返回的重量更輕。如果分數超過一個記錄,請隨機挑選。這裏的優點是你不必保留任何總和,並且你可以調整適合你的口味的概率公式。但是,再次,分數越大,它的效果就越好。

2

使用隨機數生成器完成此操作的方法是集成概率密度函數。使用一組離散值可以計算前綴和(所有值的總和),並將其存儲起來。使用此選項,您可以選擇大於隨機數的最小前綴總和(聚合到日期)值。

在數據庫上插入後的後續值必須更新。如果數據集的更新和大小的相對頻率沒有使得這樣做成本過高,這意味着可以從單個s-argable(可以通過索引查找來解析的謂詞)查詢中獲取適當的值。