GHC中的交叉模塊優化

如果我在同一個模塊中使用Criterion進行測量，那麼我有一個非遞歸函數來計算似乎表現良好的最長公共子序列（ghc 7.6.1，編譯時使用-O2 -fllvm標誌）。另一方面，如果我將函數轉換爲模塊，則只導出該函數（建議使用here），然後再用Criterion進行測量，我會得到〜2x的減速（如果將標準測試移回模塊，則會消失在哪裏定義函數）。我嘗試用INLINE編譯指示標記函數，這對跨模塊性能測量沒有任何影響。GHC中的交叉模塊優化

在我看來，GHC可能會做一個嚴格分析，當函數和主函數（從中可以訪問函數）在同一個模塊中時，它可以很好地工作，但當它們被分割時，不會。我很感激關於如何模塊化函數的指針，以便在從其他模塊調用時可以很好地執行。有問題的代碼太大，無法粘貼 - 如果您想嘗試一下，您可以看到它here。什麼我試圖做一個小例子下面是（與代碼片段）：

-- Function to find longest common subsequence given unboxed vectors a and b 
-- It returns indices of LCS in a and b 
lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int) 
lcs a b | (U.length a > U.length b) = lcsh b a True 
     | otherwise = lcsh a b False 

-- This section below measures performance of lcs function - if I move it to 
-- a different module, performance degrades ~2x - mean goes from ~1.25us to ~2.4us 
-- on my test machine 
{-- 
config :: Config 
config = defaultConfig { cfgSamples = ljust 100 } 

a = U.fromList ['a'..'j'] :: Vector Char 
b = U.fromList ['a'..'k'] :: Vector Char 

suite :: [Benchmark] 
suite = [ 
      bench "lcs 10" $ whnf (lcs a) b 
     ] 

main :: IO() 
main = defaultMainWith config (return()) suite 
--}

來源

2013-06-04 Sal

嘗試使用INLINEABLE。它可能會更好。 – Carl

@Carl，嘗試了它的lcs功能。還是一樣。 – Sal

我懷疑問題是，當它全部在一個模塊中時，GHC可以將類型變量'a'專用於'Char'，因爲它從來沒有與任何其他類型一起使用，從而消除類型類的開銷。您可以嘗試使用'SPECIALIZE'編譯指示器（或者只是手動將其更改爲'Char'），看看它是否有效。 – hammar

hammar is right，重要的問題是，編譯器可以看到lcs在使用在同一時間類型因爲它可以看到代碼，因此它可以將代碼專門化爲該特定類型。

如果編譯器不知道代碼將被使用的類型，它不得不僅產生多態代碼。這對性能不利 - 我很驚訝這只是一個大約2倍的差異。多態代碼意味着對於許多操作，需要類型級的查找，並且至少不可能內聯查找函數或恆定倍數大小[例如，對於unboxed數組/矢量訪問]。

無法在單獨的模塊中實現和使用單模塊案例，而無需在使用站點上顯示需要專門化的代碼（或者如果您知道實現站點上需要的類型，，{-# SPECIALISE foo :: Char -> Int, foo :: Bool -> Integer #-}等）。

通過標記函數{-# INLINABLE #-}，使代碼在使用地點可見，通常通過暴露界面文件中的展開來完成。

我試着用INLINE編譯標記函數，它在跨模塊性能測量中沒有任何區別。

唯一標識

lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int) 
lcs a b | (U.length a > U.length b) = lcsh b a True 
     | otherwise = lcsh a b False

INLINE或INLINABLE不作當然有差別，該功能是微不足道的，而編譯器公開其反正展開，因爲它是如此之小。即使它的展開沒有暴露，差異也不可測量。

你需要公開的函數的開折做實際工作過，至少是多態的，lcsh的findSnakes，gridWalk和cmp（cmp是一個，這裏是至關重要的，但其他人是必要的，1看到需要cmp，2.從他們那裏調用專門的cmp）。

製作那些INLINABLE，分開的模塊殼體

$ ./diffBench 
warming up 
estimating clock resolution... 
mean is 1.573571 us (320001 iterations) 
found 2846 outliers among 319999 samples (0.9%) 
    2182 (0.7%) high severe 
estimating cost of a clock call... 
mean is 40.54233 ns (12 iterations) 

benchmarking lcs 10 
mean: 1.628523 us, lb 1.618721 us, ub 1.638985 us, ci 0.950 
std dev: 51.75533 ns, lb 47.04237 ns, ub 58.45611 ns, ci 0.950 
variance introduced by outliers: 26.787% 
variance is moderately inflated by outliers

和單模塊殼體

$ ./oneModule 
warming up 
estimating clock resolution... 
mean is 1.726459 us (320001 iterations) 
found 2092 outliers among 319999 samples (0.7%) 
    1608 (0.5%) high severe 
estimating cost of a clock call... 
mean is 39.98567 ns (14 iterations) 

benchmarking lcs 10 
mean: 1.523183 us, lb 1.514157 us, ub 1.533071 us, ci 0.950 
std dev: 48.48541 ns, lb 44.43230 ns, ub 55.04251 ns, ci 0.950 
variance introduced by outliers: 26.791% 
variance is moderately inflated by outliers

之間的差是小bearably。

來源

2013-06-04 08:01:28

好點。試圖分析這一點時，我忘了專業化。 – Sal

GHC中的交叉模塊優化

回答

相關問題