在C FFI回調Haskell函數的情況下,我很好奇GHC運行時的行爲與threaded
選項。我編寫了代碼來測量基本函數回調的開銷(見下文)。儘管之前函數回調開銷已經爲discussed,但我很好奇在C代碼中啓用多線程時(即使對Haskell的函數調用總數保持不變),我觀察到的總時間急劇增加。在我的測試,我叫哈斯克爾功能f
500萬次使用兩種方案(GHC 7.0.4,RHEL,12芯盒,下面的代碼之後運行選項):啓用pthreads時C FFI回調的運行性能下降
用C
create_threads
功能單一線程:調用f
5M時間 - 總時間用Ccreate_threads
功能1.32s5個線程:每個線程調用
f
100萬次 - 這樣,總還是5M - 低於7.79s
代碼總時間 - 哈斯克爾下面的代碼是單線程Ç回調 - 評論解釋如何更新5線程測試:
t.hs:
{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Storable as SV
import Control.Monad (mapM, mapM_)
import Foreign.Ptr (Ptr, FunPtr, freeHaskellFunPtr)
import Foreign.C.Types (CInt)
f :: CInt ->()
f x =()
-- "wrapper" import is a converter for converting a Haskell function to a foreign function pointer
foreign import ccall "wrapper"
wrap :: (CInt ->()) -> IO (FunPtr (CInt ->()))
foreign import ccall safe "mt.h create_threads"
createThreads :: Ptr (FunPtr (CInt ->())) -> Ptr CInt -> CInt -> IO()
main = do
-- set threads=[1..5], l=1000000 for multi-threaded FFI callback testing
let threads = [1..1]
l = 5000000
vl = SV.replicate (length threads) (fromIntegral l) -- make a vector of l
lf <- mapM (\x -> wrap f) threads -- wrap f into a funPtr and create a list
let vf = SV.fromList lf -- create vector of FunPtr to f
-- pass vector of function pointer to f, and vector of l to create_threads
-- create_threads will spawn threads (equal to length of threads list)
-- each pthread will call back f l times - then we can check the overhead
SV.unsafeWith vf $ \x ->
SV.unsafeWith vl $ \y -> createThreads x y (fromIntegral $ SV.length vl)
SV.mapM_ freeHaskellFunPtr vf
mt.h:
#include <pthread.h>
#include <stdio.h>
typedef void(*FunctionPtr)(int);
/** Struct for passing argument to thread
**
**/
typedef struct threadArgs{
int threadId;
FunctionPtr fn;
int length;
} threadArgs;
/* This is our thread function. It is like main(), but for a thread*/
void *threadFunc(void *arg);
void create_threads(FunctionPtr*,int*,int);
噸。 C:
#include "mt.h"
/* This is our thread function. It is like main(), but for a thread*/
void *threadFunc(void *arg)
{
FunctionPtr fn;
threadArgs args = *(threadArgs*) arg;
int id = args.threadId;
int length = args.length;
fn = args.fn;
int i;
for (i=0; i < length;){
fn(i++); //call haskell function
}
}
void create_threads(FunctionPtr* fp, int* length, int numThreads)
{
pthread_t pth[numThreads]; // this is our thread identifier
threadArgs args[numThreads];
int t;
for (t=0; t < numThreads;){
args[t].threadId = t;
args[t].fn = *(fp + t);
args[t].length = *(length + t);
pthread_create(&pth[t],NULL,threadFunc,&args[t]);
t++;
}
for (t=0; t < numThreads;t++){
pthread_join(pth[t],NULL);
}
printf("All threads terminated\n");
}
彙編(GHC 7.0.4,GCC 4.4.3在情況下,它是通過使用GHC):
$ ghc -O2 t.hs mt.c -lpthread -threaded -rtsopts -optc-O2
在create_threads
與1個線程運行(上面的代碼將做) - I截止平行GC來進行測試:
$ ./t +RTS -s -N5 -g1
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.04s ( 1.05s elapsed)
GC time 0.28s ( 0.28s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.32s ( 1.34s elapsed)
%GC time 21.1% (21.2% elapsed)
與5個線程(見第一評論中的上述t.hs
main
功能運行如何編輯就爲5個線程):
$ ./t +RTS -s -N5 -g1
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.42s ( 2.27s elapsed)
GC time 0.36s ( 0.37s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 7.79s ( 2.63s elapsed)
%GC time 4.7% (13.9% elapsed)
我會明白瞭解爲什麼性能與create_threads多個並行線程下降。我首先懷疑是平行GC,但我在上面進行了測試。考慮到相同的運行時選項,MUT時間對於多個pthreads也會大幅上升。所以,這不僅僅是GC。
此外,GHC 7.4.1在這種情況下是否有任何改進?
我不打算從FFI經常回調Haskell,但它有助於在設計Haskell/C多線程庫交互時瞭解上述問題。
對於單線程和2.58s(經過1.86s)的總線時間1.42s(經過1.42s),使用4個線程(因爲我只有2個物理內核和4個線程,我認爲這是毫無意義的要求五個線程)。所以在7.4.1中可能會更好。 – 2012-01-17 23:02:17
@DanielFischer,感謝7.2.2性能指針。可能是我應該在RHEL上下載並編譯7.4.1RC以查看它是如何執行的。儘管這是相當耗時的工作。 – Sal 2012-01-17 23:10:48
我相信他們也有預編譯的二進制文件,也適用於發佈候選版本。我認爲這不會太耗時。或者不要在RHEL上使用vanilla的二進制文件? – 2012-01-17 23:14:17