我在Haskell中編寫遊戲,而且我在UI中的當前傳遞涉及很多程序生成的幾何。我目前專注於識別一個特定操作(C-ISH僞代碼)的性能:Haskell乘法 - 加法運算的數學表現
Vec4f multiplier, addend;
Vec4f vecList[];
for (int i = 0; i < count; i++)
vecList[i] = vecList[i] * multiplier + addend;
也就是說,四個浮點沼澤標準的乘加,那種成熟的SIMD優化的事情。
結果將發送到OpenGL頂點緩衝區,因此最終必須將其轉儲到平面C數組中。出於同樣的原因,計算可能應該在C'float'類型上完成。
我已經尋找了一個庫或本地慣用的解決方案,以便在Haskell中快速完成這種事情,但是我提出的每個解決方案都似乎懸停在性能的2%左右(即50倍與GCC的C相比具有正確的標誌更慢)。誠然,我幾個星期前從Haskell開始,所以我的經驗是有限的 - 這就是爲什麼我要來找你們。你們中的任何一個人都可以爲更快的Haskell實現提供建議,或者提供有關如何編寫高性能Haskell代碼的文檔指南?
首先,最近的Haskell解決方案(時鐘約12秒)。我嘗試了this SO post的爆炸模式,但它並沒有改變AFAICT。用'(\ i v - > v * 4)'代替'multAdd'使執行時間縮短到1.9秒,所以按位填充(以及隨之而來的自動優化挑戰)似乎沒有太多錯誤。
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-}
import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign.C.Types
import Data.Bits
repCount = 10000
arraySize = 20000
a = fromList $ [0.2::CFloat, 0.1, 0.6, 1.0]
m = fromList $ [0.99::CFloat, 0.7, 0.8, 0.6]
multAdd :: Int -> CFloat -> CFloat
multAdd !i !v = v * (m ! (i .&. 3)) + (a ! (i .&. 3))
multList :: Int -> Vector CFloat -> Vector CFloat
multList !count !src
| count <= 0 = src
| otherwise = multList (count-1) $ V.imap multAdd src
main = do
print $ Data.Vector.Storable.sum $ multList repCount $
Data.Vector.Storable.replicate (arraySize*4) (0::CFloat)
下面是我在C中的代碼。這裏的代碼有幾個#ifdefs,它可以防止它被直接編譯;向下滾動以查看測試驅動程序。
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
typedef float v4fs __attribute__ ((vector_size (16)));
typedef struct { float x, y, z, w; } Vector4;
void setv4(v4fs *v, float x, float y, float z, float w) {
float *a = (float*) v;
a[0] = x;
a[1] = y;
a[2] = z;
a[3] = w;
}
float sumv4(v4fs *v) {
float *a = (float*) v;
return a[0] + a[1] + a[2] + a[3];
}
void vecmult(v4fs *MAYBE_RESTRICT s, v4fs *MAYBE_RESTRICT d, v4fs a, v4fs m) {
for (int j = 0; j < N; j++) {
d[j] = s[j] * m + a;
}
}
void scamult(float *MAYBE_RESTRICT s, float *MAYBE_RESTRICT d,
Vector4 a, Vector4 m) {
for (int j = 0; j < (N*4); j+=4) {
d[j+0] = s[j+0] * m.x + a.x;
d[j+1] = s[j+1] * m.y + a.y;
d[j+2] = s[j+2] * m.z + a.z;
d[j+3] = s[j+3] * m.w + a.w;
}
}
int main() {
v4fs a, m;
v4fs *s, *d;
setv4(&a, 0.2, 0.1, 0.6, 1.0);
setv4(&m, 0.99, 0.7, 0.8, 0.6);
s = calloc(N, sizeof(v4fs));
d = s;
double start = clock();
for (int i = 0; i < M; i++) {
#ifdef COPY
d = malloc(N * sizeof(v4fs));
#endif
#ifdef VECTOR
vecmult(s, d, a, m);
#else
Vector4 aa = *(Vector4*)(&a);
Vector4 mm = *(Vector4*)(&m);
scamult((float*)s, (float*)d, aa, mm);
#endif
#ifdef COPY
free(s);
s = d;
#endif
}
double end = clock();
float sum = 0;
for (int j = 0; j < N; j++) {
sum += sumv4(s+j);
}
printf("%-50s %2.5f %f\n\n", NAME,
(end - start)/(double) CLOCKS_PER_SEC, sum);
}
該腳本將編譯並運行帶有多個gcc標誌組合的測試。 cmath-64-native-O3-restrict-vector-nocopy在我的系統上的性能最好,耗時0.22秒。
import System.Process
import GHC.IOBase
cBase = ("cmath", "gcc mult.c -ggdb --std=c99 -DM=10000 -DN=20000")
cOptions = [
[("32", "-m32"), ("64", "-m64")],
[("generic", ""), ("native", "-march=native -msse4")],
[("O1", "-O1"), ("O2", "-O2"), ("O3", "-O3")],
[("restrict", "-DMAYBE_RESTRICT=__restrict__"),
("norestrict", "-DMAYBE_RESTRICT=")],
[("vector", "-DVECTOR"), ("scalar", "")],
[("copy", "-DCOPY"), ("nocopy", "")]
]
-- Fold over the Cartesian product of the double list. Probably a Prelude function
-- or two that does this, but hey. The 'perm' referred to permutations until I realized
-- that this wasn't actually doing permutations. '
permfold :: (a -> a -> a) -> a -> [[a]] -> [a]
permfold f z [] = [z]
permfold f z (x:xs) = concat $ map (\a -> (permfold f (f z a) xs)) x
prepCmd :: (String, String) -> (String, String) -> (String, String)
prepCmd (name, cmd) (namea, cmda) =
(name ++ "-" ++ namea, cmd ++ " " ++ cmda)
runCCmd name compileCmd = do
res <- system (compileCmd ++ " -DNAME=\\\"" ++ name ++ "\\\" -o " ++ name)
if res == ExitSuccess
then do system ("./" ++ name)
return()
else putStrLn $ name ++ " did not compile"
main = do
mapM_ (uncurry runCCmd) $ permfold prepCmd cBase cOptions
重寫使用更多的慣用類型大致減半運行時,http://hpaste.org/fastcgi/hpaste.fcgi/view?id=26551#a26551但我轉發給羅馬考慮。 – 2010-06-25 05:26:20