優化哈斯克爾文本處理

我寫在Haskell一些簡單的字符計數程序，存儲統計在新的數據類型：優化哈斯克爾文本處理

data Stat = Stat { 
    stChars :: !Int, 
    stVowels :: !Int, 
    stPairEL :: !Int, 
    stWords :: !Int 
}

我跑這了成百上千的純文本文件，每個約50K-100K。

tabulateFile :: FilePath -> IO Stat 
tabulateFile path = do 
    putStrLn path 
    contents <- L.readFile path 
    return $! tabulateText ' ' contents defaultStat

而不是使用摺疊，我使用原始遞歸，所以我可以保留前一個字符。

tabulateText :: Char -> L.ByteString -> Stat -> Stat 
tabulateText lastChr bs stat = 
    case U.uncons bs of 
    Nothing -> stat 
    Just (chr, newBs) -> 
     tabulateText lchr newBs (countChar lastChr lchr stat) 
     where lchr = toLower chr 

{-# INLINE countChar #-} 
countChar :: Char -> Char -> Stat -> Stat 
countChar !lastChr !chr !(Stat stChars stVowels stPairEL stWords) = 
    Stat 
    (stChars + 1) 
    (stVowels + (countIf $ isVowel chr)) 
    (stPairEL + (countIf (lastChr == 'e' && chr == 'l'))) 
    (stWords + (countIf ((not $ isLetter lastChr) && isLetter chr))) 

isVowel :: Char -> Bool 
isVowel c = Set.member c vowels 

vowels = Set.fromAscList ['a', 'e', 'i', 'o', 'u', ...] -- rest of vowels elided

現在，它更是作爲運行cat * | wc的兩倍慢，但我的直覺告訴我，該文件I/O應足以抵消由佳緣需要的CPU時間。使用熱緩存只需使用cat * | wc進程約20MB/s，但使用我的Haskell程序（編譯爲-O）的運行速度低於10MB/s，即使經過一些基本的優化。剖析告訴我大部分時間都花在tabulateText和countChar之間。

有什麼我錯過了，我可以在這裏優化？

編輯：完整的文件粘貼到http://hpaste.org/74638

來源

2012-09-12 erjiang

你可以發佈完整的文件到一些hpaste並在這裏發佈鏈接。它包含許多缺失的函數，我不想實現這些以便能夠運行您的代碼。 – Satvik

是的，對不起，我沒有早點知道。我已將鏈接添加到問題的結尾。 – erjiang

您應提供進口所以有人可以編譯代碼。但是，這裏有幾件事情看起來可能：

編譯-O2 -funbox-strict-fields（得到嚴格的領域的利益）
tabulateText應該在lastChr嚴格，stat
Set.member似乎是一個非常昂貴的方式來進行平等比較。使用跳轉表。

例如，

isSpaceChar8 :: Char -> Bool 
isSpaceChar8 c = 
    c == ' '  || 
    c == '\t' || 
    c == '\n' || 
    c == '\r' || 
    c == '\f' || 
    c == '\v' || 
    c == '\xa0'

這將內聯和優化得很好。

不知道countIf是幹什麼的，但它不好。我懷疑它是一個if，你返回0？如何：

Stat 
    (a + 1) 
    (if isVowel c then a + 1 else a) 
    ...

然後看看核心。

來源

2012-09-12 17:13:52

關於跳轉表，我不清楚什麼會或不會優化到跳轉表。那是關於文件嗎？這將工作多字節字符？我使用Data.Char中的'isSpace'。我認爲這已經優化了？ – erjiang

使'tabulateText'中的'lastChr'嚴格，使運行時翻倍，可能是因爲這是在不需要時不運行'toLower'的淨節約。 – erjiang

{-# LANGUAGE BangPatterns #-} 
import qualified Data.ByteString.Lazy.Char8 as U 
import qualified Data.ByteString.Lazy as L 
import Data.Word 
import Data.Char 
import Control.Applicative 

data Stat = Stat { 
    stChars :: !Int, 
    stVowels :: !Int, 
    stPairEL :: !Int, 
    stWords :: !Int 
} deriving Show 
defaultStat = Stat 0 0 0 0 

{-# INLINE tabulateFile #-} 
tabulateFile :: FilePath -> IO Stat 
tabulateFile path = newTabulate <$> L.readFile path 

{-# INLINE newTabulate #-} 
newTabulate :: L.ByteString -> Stat 
newTabulate = snd . U.foldl' countIt (' ',defaultStat) 
    where 
     {-# INLINE countIt #-} 
     countIt :: (Char,Stat) -> Char -> (Char,Stat) 
     countIt (!lastChr,!Stat stChars stVowels stPairEL stWords) !chr = 
      (chr,Stat 
       (stChars + 1) 
       (if isVowel chr then stVowels + 1 else stVowels) 
       (if (lastChr == 'e' && chr == 'l') then stPairEL + 1 else stPairEL) 
       (if ((isSpace lastChr) && isLetter chr) then stWords+1 else stWords)) 

{-# INLINE isVowel #-} 
isVowel :: Char -> Bool 
isVowel c = 
    c == 'e' || 
    c == 'a' || 
    c == 'i' || 
    c == 'o' || 
    c == 'u' 



main:: IO() 
main = do 
    stat <- tabulateFile "./test.txt" 
    print stat

Don提出的大多數優化都包含在使用高效的foldl'中。性能比cat + wc稍微慢一些，但是可以做更多的計算。我沒有在非常大的文件上測試它，但它應該可以與cat + wc相媲美。

編譯-O2 -funbox-strict-fields以獲得優化的代碼。

在查看內核並查看是否可以進行更多優化後，我會更改它。優化的一個可能的優點是，如果條件超出構造函數的同時構造統計信息，例如，如果chr是元音，那麼它已經是letter，因此如果在stWords等中不需要其他元素，但那樣會真的炸燬你的代碼，但你可以試着看看它是否對大文件有幫助。

來源

2012-09-12 18:28:52 Satvik

經過測試其他替代方案後，似乎高CPU使用率主要是因爲我使用Data.ByteString.Lazy.UTF8。我通過修改tabulateText中的數據結構以使用通過UTF8 ByteString的foldl來削減可忽略的運行時間。

鑑於此，我將程序並行化了文件，並且有時能夠在我的機器上獲得7倍的加速。

我第一包裹tabulateFile與unsafePerformIO：

unsafeTabulateFile :: FilePath -> Stat 
unsafeTabulateFile f = 
    unsafePerformIO $ tabulateFile f

，然後用於Control.Parallel.Strategies做parMap rseq unsafeTabulateFile files。

來源

2012-09-16 19:42:09 erjiang

優化哈斯克爾文本處理

回答

相關問題