啓用pthreads時C FFI回調的運行性能下降

在C FFI回調Haskell函數的情況下，我很好奇GHC運行時的行爲與threaded選項。我編寫了代碼來測量基本函數回調的開銷（見下文）。儘管之前函數回調開銷已經爲discussed，但我很好奇在C代碼中啓用多線程時（即使對Haskell的函數調用總數保持不變），我觀察到的總時間急劇增加。在我的測試，我叫哈斯克爾功能f 500萬次使用兩種方案（GHC 7.0.4，RHEL，12芯盒，下面的代碼之後運行選項）：啓用pthreads時C FFI回調的運行性能下降

用C create_threads功能單一線程：調用f 5M時間 - 總時間用C create_threads功能1.32s
5個線程：每個線程調用f 100萬次 - 這樣，總還是5M - 低於7.79s

代碼總時間 - 哈斯克爾下面的代碼是單線程Ç回調 - 評論解釋如何更新5線程測試：

t.hs：

{-# LANGUAGE BangPatterns #-} 
import qualified Data.Vector.Storable as SV 
import Control.Monad (mapM, mapM_) 
import Foreign.Ptr (Ptr, FunPtr, freeHaskellFunPtr) 
import Foreign.C.Types (CInt) 

f :: CInt ->() 
f x =() 

-- "wrapper" import is a converter for converting a Haskell function to a foreign function pointer 
foreign import ccall "wrapper" 
    wrap :: (CInt ->()) -> IO (FunPtr (CInt ->())) 

foreign import ccall safe "mt.h create_threads" 
    createThreads :: Ptr (FunPtr (CInt ->())) -> Ptr CInt -> CInt -> IO() 

main = do 
    -- set threads=[1..5], l=1000000 for multi-threaded FFI callback testing 
    let threads = [1..1] 
     l = 5000000 
     vl = SV.replicate (length threads) (fromIntegral l) -- make a vector of l 
    lf <- mapM (\x -> wrap f) threads -- wrap f into a funPtr and create a list 
    let vf = SV.fromList lf -- create vector of FunPtr to f 
    -- pass vector of function pointer to f, and vector of l to create_threads 
    -- create_threads will spawn threads (equal to length of threads list) 
    -- each pthread will call back f l times - then we can check the overhead 
    SV.unsafeWith vf $ \x -> 
    SV.unsafeWith vl $ \y -> createThreads x y (fromIntegral $ SV.length vl) 
    SV.mapM_ freeHaskellFunPtr vf

mt.h：

#include <pthread.h> 
#include <stdio.h> 

typedef void(*FunctionPtr)(int); 

/** Struct for passing argument to thread 
** 
**/ 
typedef struct threadArgs{ 
    int threadId; 
    FunctionPtr fn; 
    int length; 
} threadArgs; 


/* This is our thread function. It is like main(), but for a thread*/ 
void *threadFunc(void *arg); 
void create_threads(FunctionPtr*,int*,int);

噸。 C：

#include "mt.h" 


/* This is our thread function. It is like main(), but for a thread*/ 
void *threadFunc(void *arg) 
{ 
    FunctionPtr fn; 
    threadArgs args = *(threadArgs*) arg; 
    int id = args.threadId; 
    int length = args.length; 
    fn = args.fn; 
    int i; 
    for (i=0; i < length;){ 
    fn(i++); //call haskell function 
    } 
} 

void create_threads(FunctionPtr* fp, int* length, int numThreads) 
{ 
    pthread_t pth[numThreads]; // this is our thread identifier 
    threadArgs args[numThreads]; 
    int t; 
    for (t=0; t < numThreads;){ 
    args[t].threadId = t; 
    args[t].fn = *(fp + t); 
    args[t].length = *(length + t); 
    pthread_create(&pth[t],NULL,threadFunc,&args[t]); 
    t++; 
    } 

    for (t=0; t < numThreads;t++){ 
    pthread_join(pth[t],NULL); 
    } 
    printf("All threads terminated\n"); 
}

彙編（GHC 7.0.4，GCC 4.4.3在情況下，它是通過使用GHC）：

$ ghc -O2 t.hs mt.c -lpthread -threaded -rtsopts -optc-O2

在create_threads與1個線程運行（上面的代碼將做） - I截止平行GC來進行測試：

$ ./t +RTS -s -N5 -g1 
INIT time 0.00s ( 0.00s elapsed) 
    MUT time 1.04s ( 1.05s elapsed) 
    GC time 0.28s ( 0.28s elapsed) 
    EXIT time 0.00s ( 0.00s elapsed) 
    Total time 1.32s ( 1.34s elapsed) 

    %GC time  21.1% (21.2% elapsed)

與5個線程（見第一評論中的上述t.hsmain功能運行如何編輯就爲5個線程）：

$ ./t +RTS -s -N5 -g1 
INIT time 0.00s ( 0.00s elapsed) 
    MUT time 7.42s ( 2.27s elapsed) 
    GC time 0.36s ( 0.37s elapsed) 
    EXIT time 0.00s ( 0.00s elapsed) 
    Total time 7.79s ( 2.63s elapsed) 

    %GC time  4.7% (13.9% elapsed)

我會明白瞭解爲什麼性能與create_threads多個並行線程下降。我首先懷疑是平行GC，但我在上面進行了測試。考慮到相同的運行時選項，MUT時間對於多個pthreads也會大幅上升。所以，這不僅僅是GC。

此外，GHC 7.4.1在這種情況下是否有任何改進？

我不打算從FFI經常回調Haskell，但它有助於在設計Haskell/C多線程庫交互時瞭解上述問題。

來源

2012-01-17 Sal

對於單線程和2.58s（經過1.86s）的總線時間1.42s（經過1.42s），使用4個線程（因爲我只有2個物理內核和4個線程，我認爲這是毫無意義的要求五個線程）。所以在7.4.1中可能會更好。 – 2012-01-17 23:02:17

@DanielFischer，感謝7.2.2性能指針。可能是我應該在RHEL上下載並編譯7.4.1RC以查看它是如何執行的。儘管這是相當耗時的工作。 – Sal 2012-01-17 23:10:48

我相信他們也有預編譯的二進制文件，也適用於發佈候選版本。我認爲這不會太耗時。或者不要在RHEL上使用vanilla的二進制文件？ – 2012-01-17 23:14:17

我相信這裏的關鍵問題是，GHC運行時間表C如何回調Haskell？雖然我不確定，但我懷疑所有的C回調都是由最初由外部調用的Haskell線程來處理的，至少是ghc-7.2.1（我正在使用它）。

這將解釋您從（從一個線程移動到5）時看到的大幅放緩。如果五個線程都回調到同一個Haskell線程中，那麼Haskell線程將會出現重大爭用來完成所有回調。

爲了測試這個，我修改了你的代碼，以便Haskell在調用create_threads之前分出一個新的線程，而create_threads每個調用只產生一個線程。如果我是正確的，每個操作系統線程都會有一個專用的Haskell線程來執行工作，所以應該有更少的爭用。雖然這仍然是單線程版本的將近兩倍，但它比原始的多線程版本要快得多，這爲該理論提供了一些證據。如果我使用+RTS -qm關閉線程遷移，則差異會小得多。

由於Daniel Fischer報告了ghc-7.2.2的不同結果，我預計版本會改變Haskell調度回調的方式。也許ghc-users列表上的某個人可以提供更多信息;我在7.2.2或7.4.1的發行說明中看不到任何可能的東西。

來源

2012-01-18 12:30:22

感謝您的反饋意見。你的理論看起來很合理。似乎有某種爭用正在進行。我也懷疑回調是單線程的。你所描述的符合觀察。我昨天還通過電子郵件發送了ghc用戶名單。 – Sal 2012-01-18 12:45:01

在我的測試中驗證了您的觀察結果。如果將每個pthread映射到一個用於回調的Haskell線程（在7.0.4中），運行時就會很好地擴展。將您的解決方案標記爲答案。 – Sal 2012-01-20 12:52:39

啓用pthreads時C FFI回調的運行性能下降

回答

相關問題