2014-01-15 65 views
5

我正在嘗試使用Haskell查找文件中字符的頻率。我希望能夠處理〜500MB大小的文件。字符的頻率

什麼我試過到目前爲止

  1. 它的工作,但就是有點慢,因爲它解析文件256倍

    calculateFrequency :: L.ByteString -> [(Word8, Int64)] 
    calculateFrequency f = foldl (\acc x -> (x, L.count x f):acc) [] [255, 254.. 0] 
    
  2. 我一直在使用Data.Map也嘗試但程序內存不足(在ghc解釋器中)。

    import qualified Data.ByteString.Lazy as L 
    import qualified Data.Map as M 
    
    calculateFrequency' :: L.ByteString -> [(Word8, Int64)] 
    calculateFrequency' xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) (M.empty) xs 
    
+0

如果使用'ghc -O2'編譯會發生什麼?可能避免內存問題的嚴格優化可能只能在那時開始。 –

+0

還在記憶中。 –

+1

如果切換到Data.Map.Strict,那該怎麼辦?http://hackage.haskell.org/package/containers-0.5.0.0/docs/Data-Map-Strict.html –

回答

14

使用可變的,拆箱向量這裏是一個實現而不是更高層次的結構。它還使用conduit來讀取文件以避免惰性I/O。

import   Control.Monad.IO.Class 
import qualified Data.ByteString    as S 
import   Data.Conduit 
import   Data.Conduit.Binary   as CB 
import qualified Data.Conduit.List   as CL 
import qualified Data.Vector.Unboxed.Mutable as VM 
import   Data.Word     (Word8) 

type Freq = VM.IOVector Int 

newFreq :: MonadIO m => m Freq 
newFreq = liftIO $ VM.replicate 256 0 

printFreq :: MonadIO m => Freq -> m() 
printFreq freq = 
    liftIO $ mapM_ go [0..255] 
    where 
    go i = do 
     x <- VM.read freq i 
     putStrLn $ show i ++ ": " ++ show x 

addFreqWord8 :: MonadIO m => Freq -> Word8 -> m() 
addFreqWord8 f w = liftIO $ do 
    let index = fromIntegral w 
    oldCount <- VM.read f index 
    VM.write f index (oldCount + 1) 

addFreqBS :: MonadIO m => Freq -> S.ByteString -> m() 
addFreqBS f bs = 
    loop (S.length bs - 1) 
    where 
    loop (-1) = return() 
    loop i = do 
     addFreqWord8 f (S.index bs i) 
     loop (i - 1) 

-- | The main entry point. 
main :: IO() 
main = do 
    freq <- newFreq 
    runResourceT 
     $ sourceFile "random" 
     $$ CL.mapM_ (addFreqBS freq) 
    printFreq freq 

我跑這對隨機數據的500MB,並@ josejuan的基於UArray回答比較:

  • 管道基礎/可變矢量:1.006s
  • UArray:17。962s

我想應該是可以保持多少的josejuan的高層次的方法風采的又保持可變矢量執行的速度,但我還沒有機會嘗試推行類似的東西還。另外,請注意,對於一些通用幫助器函數(如Data.ByteString.mapM或Data.Conduit.Binary.mapM),實現可能會更簡單,而不會影響性能。

您也可以play with this implementation on FP Haskell Center

編輯:我在conduit中添加了一個缺失的函數並清理了一下代碼;它現在看起來像下面這樣:

import   Control.Monad.Trans.Class (lift) 
import   Data.ByteString    (ByteString) 
import   Data.Conduit    (Consumer, ($$)) 
import qualified Data.Conduit.Binary   as CB 
import qualified Data.Vector.Unboxed   as V 
import qualified Data.Vector.Unboxed.Mutable as VM 
import   System.IO     (stdin) 

freqSink :: Consumer ByteString IO (V.Vector Int) 
freqSink = do 
    freq <- lift $ VM.replicate 256 0 
    CB.mapM_ $ \w -> do 
     let index = fromIntegral w 
     oldCount <- VM.read freq index 
     VM.write freq index (oldCount + 1) 
    lift $ V.freeze freq 

main :: IO() 
main = (CB.sourceHandle stdin $$ freqSink) >>= print 

功能的唯一區別是如何打印頻率。

+0

是否有任何理由使用'MonadIO'類而不是針對'runResourceT'預期的任何類型?它是否有任何性能影響? –

+0

非常好(令人印象深刻)的解決方案! – josejuan

+0

@ChrisTaylor不,你可以專注於'ResourceT IO'。或者,如果你想,你可以完全擺脫'ResourceT'的使用,它只是讓代碼稍微長一些:https://gist.github.com/snoyberg/8436149 –

4

這對我的作品在我的電腦上:

module Main where 
import qualified Data.HashMap.Strict as M 
import qualified Data.ByteString.Lazy as L 
import Data.Word 
import Data.Int 

calculateFrequency :: L.ByteString -> [(Word8, Int64)] 
calculateFrequency xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) M.empty xs 

main = do 
    bs <- L.readFile "E:\\Steam\\SteamApps\\common\\Sid Meier's Civilization V\\Assets\\DLC\\DLC_Deluxe\\Behind the Scenes\\Behind the Scenes.wmv" 
    print (calculateFrequency bs) 

沒有用完的內存,甚至是加載整個文件,但永遠需要(約一分鐘) 600MB +文件!我使用ghc 7.6.3編譯了這個。

我應該指出,除了嚴格的HashMap而不是懶惰Map之外,代碼基本相同。

請注意,在這種情況下insertWithHashMap的兩倍,比Map快。在我的機器,代碼爲54秒寫一次執行,同時使用Map版本需要107

+0

您可以使用'Data.Map.Strict'(原始源代碼所需的唯一更改) – josejuan

+0

^您可以,但不應該。我只是用一些運行時信息更新了我的答案。 –

+0

然後,你不應該使用'HashMap',使用'STArray'('MArray' ... :) :) – josejuan

6

@Alex答案是好的,但,只有256個值(索引)數組應該會更好

import qualified Data.ByteString.Lazy as L 
import qualified Data.Array.Unboxed as A 
import qualified Data.ByteString as B 
import Data.Int 
import Data.Word 

fq :: L.ByteString -> A.UArray Word8 Int64 
fq = A.accumArray (+) 0 (0, 255) . map (\c -> (c, 1)) . concat . map B.unpack . L.toChunks 

main = L.getContents >>= print . fq 

@alex代碼取(用於我的示例文件)24.81 segs,使用數組取7.77 segs。

更新:

雖然Snoyman溶液是較好的,改進避免unpack也許

fq :: L.ByteString -> A.UArray Word8 Int64 
fq = A.accumArray (+) 0 (0, 255) . toCounterC . L.toChunks 
    where toCounterC [] = [] 
      toCounterC (x:xs) = toCounter x (B.length x) xs 
      toCounter _ 0 xs = toCounterC xs 
      toCounter x i xs = (B.index x i', 1): toCounter x i' xs 
           where i' = i - 1 

與〜50%的加速。

更新:

使用IOVector作爲Snoyman是Conduit版本(快一點真的,但是這是一個原始代碼,更好地使用Conduit

import   Data.Int 
import   Data.Word 
import   Control.Monad.IO.Class 
import qualified Data.ByteString.Lazy   as L 
import qualified Data.Array.Unboxed   as A 
import qualified Data.ByteString    as B 
import qualified Data.Vector.Unboxed.Mutable as V 

fq :: L.ByteString -> IO (V.IOVector Int64) 
fq xs = 
    do 
     v <- V.replicate 256 0 :: IO (V.IOVector Int64) 
     g v $ L.toChunks xs 
     return v 
    where g v = toCounterC 
       where toCounterC [] = return() 
         toCounterC (x:xs) = toCounter x (B.length x) xs 
         toCounter _ 0 xs = toCounterC xs 
         toCounter x i xs = do 
              let i' = i - 1 
               w = fromIntegral $ B.index x i' 
              c <- V.read v w 
              V.write v w (c + 1) 
              toCounter x i' xs 

main = do 
      v <- L.getContents >>= fq 
      mapM_ (\i -> V.read v i >>= liftIO . putStr . (++", ") . show) [0..255] 
0

我的兩美分(使用STUArray)。無法在此與其他解決方案進行比較。有人可能願意嘗試...

module Main where 

import Data.Array.ST (runSTUArray, newArray, readArray, writeArray) 
import Data.Array.Unboxed (UArray) 
import qualified Data.ByteString.Lazy as L (ByteString, unpack, getContents) 
import Data.Word 
import Data.Int 
import Control.Monad (forM_) 

calculateFrequency :: L.ByteString -> UArray Word8 Int64 
calculateFrequency bs = runSTUArray $ do 
    a <- newArray (0, 255) 0 
    forM_ (L.unpack bs) $ \i -> readArray a i >>= writeArray a i . succ 
    return a 

main = L.getContents >>= print . calculateFrequency