朱莉婭語。如何擊敗矢量化操作？

據說Julia for-loops與矢量化操作一樣快，甚至更快（如果它們使用得當的話）。我有兩段代碼。這個想法是爲給定的0-1序列找到一個樣本統計量，它是x（在這兩個例子中，我試圖找到一個總和，但是有更復雜的例子，我只是想了解一般意義在我的代碼中的性能陷阱）。第一個看起來像：朱莉婭語。如何擊敗矢量化操作？

S = 2 * sum(x) - n 
s_obs_new = abs(S)/sqrt(2 * n) 
pval = erfc(s_obs_new)

，第二個是什麼「天真」的經典：

S = 0 
for i in eachindex(x) 
    S += x[i] 
end 
S = 2 * S - n 
s_obs_new = abs(S)/sqrt(2 * n) 
pval = erfc(s_obs_new)

使用@benchmark我發現的第一例的運行時間爲11.8毫秒，第二個是38毫秒。

這個例子對我來說非常重要，因爲還有很多其他的地方，矢量化是不可能的，所以我想在devectorized「方式」中進行與矢量化一樣快的計算。

是否有任何想法，爲什麼devectorized代碼可能比矢量化速度慢4倍？類型穩定性是OK，沒有不必要的大的存儲器分配等

用於第一函數的代碼是：

function frequency_monobit_test1(x :: Array{Int8, 1}, n = 0) 
# count 1 and 0 in sequence, we want the number 
# of 1's and 0's to be approximately the same 
# reccomendation n >= 100 
# decision Rule(at 1% level): if pval < 0.01 -> non-random 
if (n == 0) 
    n = length(x) 
end 
S = 2 * sum(x) - n 
s_obs_new = abs(S)/sqrt(2 * n) 
pval = erfc(s_obs_new) 
return pval

二是：

function frequency_monobit_test2(x :: Array{Int8, 1}, n = 0) 
# count 1 and 0 in sequence, we want the number 
# of 1's and 0's to be approximately the same 
# reccomendation n >= 100 
# decision Rule(at 1% level): if pval < 0.01 -> non-random 
if (n == 0) 
    n = length(x) 
end 
S = 0 
@inbounds for i in eachindex(x) 
    S += x[i] 
end 
S = 2 * S - n 
s_obs_new = abs(S)/sqrt(2 * n) 
pval = erfc(s_obs_new) 
return pval

來源

2017-06-28 Kirill Tsar.

首先，官方的[性能提示]（https://docs.julialang.org/en/stable/manual/performance-tips/）是讀取 –

另一件事，試圖重要的是把'@inbounds @ simd'前面的'for'語句 –

請向我們展示您運行的實際代碼以獲取基準。 –

這是一個好奇案件。當在Int64變量中累積Int8時，似乎存在性能問題。

讓我們嘗試這些功能：

using SpecialFunctions, BenchmarkTools 

function frequency_monobit_test1(x, n=length(x)) 
    S = sum(x) 
    return erfc(abs(2S - n)/sqrt(2n)) 
end 

function frequency_monobit_test3(typ::Type{<:Integer}, x, n=length(x)) 
    S = zero(typ) 
    for i in eachindex(x) 
     @inbounds S += x[i] 
    end 
    return erfc(abs(2S - n)/sqrt(2n)) 
end

初始化向量一些

N = 2^25; 
x64 = rand(0:1, N); 
x8 = rand(Int8[0, 1], N); 
xB = rand(Bool, N); 
xb = bitrand(N);

標杆：

對於Int64輸入：

julia> @btime frequency_monobit_test1($x64) 
    17.540 ms (0 allocations: 0 bytes) 
0.10302739916042186 

julia> @btime frequency_monobit_test3(Int64, $x64) 
    17.796 ms (0 allocations: 0 bytes) 
0.10302739916042186 

julia> @btime frequency_monobit_test3(Int32, $x64) 
    892.715 ms (67106751 allocations: 1023.97 MiB) 
0.10302739916042186

我們看到，sum和一個明確的循環同樣快，並且用Int32進行初始化是一個壞主意。

對於Int32輸入：

julia> @btime frequency_monobit_test1($x32) 
    9.137 ms (0 allocations: 0 bytes) 
0.2386386867682374 

julia> @btime frequency_monobit_test3(Int64, $x32) 
    8.839 ms (0 allocations: 0 bytes) 
0.2386386867682374 

julia> @btime frequency_monobit_test3(Int32, $x32) 
    7.274 ms (0 allocations: 0 bytes) 
0.2386386867682374

sum和循環是在速度相似。積累到Int32可以節省一點時間。

Int8輸入：

julia> @btime frequency_monobit_test1($x8) 
    5.681 ms (0 allocations: 0 bytes) 
0.16482999123032094 

julia> @btime frequency_monobit_test3(Int64, $x8) 
    19.517 ms (0 allocations: 0 bytes) 
0.16482999123032094 

julia> @btime frequency_monobit_test3(Int32, $x8) 
    4.815 ms (0 allocations: 0 bytes) 
0.16482999123032094

如果稍微快一點積累到Int32時，顯式循環，但聖牛！ Int64怎麼回事？這是超級慢！

Bool怎麼樣？

julia> @btime frequency_monobit_test1($xB) 
    9.627 ms (0 allocations: 0 bytes) 
0.7728544347518309 

julia> @btime frequency_monobit_test3(Int64, $xB) 
    9.629 ms (0 allocations: 0 bytes) 
0.7728544347518309 

julia> @btime frequency_monobit_test3(Int32, $xB) 
    4.815 ms (0 allocations: 0 bytes) 
0.7728544347518309

環路和sum具有相同的速度，但是，積累到Int32節省了一半的時間。

現在，我們將嘗試BitArray：

julia> @btime frequency_monobit_test1($xb) 
    259.044 μs (0 allocations: 0 bytes) 
0.7002576522570715 

julia> @btime frequency_monobit_test3(Int64, $xb) 
    19.423 ms (0 allocations: 0 bytes) 
0.7002576522570715 

julia> @btime frequency_monobit_test3(Int32, $xb) 
    19.430 ms (0 allocations: 0 bytes) 
0.7002576522570715

所以，sum上BitArray是快瘋了，因爲你可以進行分塊的增加，但在一個循環中提取單一稀土元素具有一些開銷。

結論：

你可以得到類似的或更好的性能與環比sum，除了BitArray S的是一個非常特殊的情況。
如果你知道你陣列的長度，並知道Int32就足以保存你的總和，那麼這是一個節省時間。
當積累Int8 s到Int64有一些很奇怪的事情發生。我不知道爲什麼表現如此糟糕。
如果您只對零和1有興趣，請使用Bool s的數組，而不是Int8 s，並且可能積累到Int32。
BitArray在某些情況下s可以超快。
sum對於Int8的載體很奇怪，速度是sum(::Vector{Bool})的兩倍。

來源

2017-07-02 12:46:04 DNF

朱莉婭語。如何擊敗矢量化操作？

回答

相關問題