使用Iteratee庫編寫「wc -l」 - 如何篩選換行符？

我試圖拿出使用Haskell Iteratee庫的「wc -l」的等價物。下面是「WC」（這只是計算的話 - 類似於上hackage iteratee例子的代碼）的代碼，運行速度非常快：使用Iteratee庫編寫「wc -l」 - 如何篩選換行符？

{-# LANGUAGE BangPatterns #-} 
import Data.Iteratee as I 
import Data.ListLike as LL 
import Data.Iteratee.IO 
import Data.ByteString 


length1 :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a 
length1 = liftI (step 0) 
    where 
    step !i (Chunk xs) = liftI (step $ i + fromIntegral (LL.length xs)) 
    step !i stream  = idone i stream 
{-# INLINE length1 #-} 
main = do 
    i' <- enumFile 1024 "/usr/share/dict/words" (length1 :: (Monad m) => Iteratee ByteString m Int) 
    result <- run i' 
    print result 
    {- Time measured on a linux x86 box: 
    $ time ./test ## above haskell compiled code 
    4950996 

    real 0m0.013s 
    user 0m0.004s 
    sys  0m0.007s 

    $ time wc -c /usr/share/dict/words 
    4950996 /usr/share/dict/words 

    real 0m0.003s 
    user 0m0.000s 
    sys  0m0.002s 
    -}

現在，怎麼辦你擴展它來計算太快運行的行數？我做了一個使用Prelude.filter的版本來過濾只有「\ n」的長度，但由於內存太多而導致它比linux「wc -l」慢，而gc（懶惰評估，我猜）。所以，我寫了使用Data.ListLike.filter另一個版本，但它不能編譯，因爲它沒有類型檢查 - 在這裏幫助，將不勝感激：如果你正在閱讀ByteString塊

{-# LANGUAGE BangPatterns #-} 
    import Data.Iteratee as I 
    import Data.ListLike as LL 
    import Data.Iteratee.IO 
    import Data.ByteString 
    import Data.Char 
    import Data.ByteString.Char8 (pack) 

    numlines :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a 
    numlines = liftI $ step 0 
    where 
     step !i (Chunk xs) = liftI (step $i + fromIntegral (LL.length $ LL.filter (\x -> x == Data.ByteString.Char8.pack "\n") xs)) 
     step !i stream = idone i stream 
    {-# INLINE numlines #-} 

    main = do 
    i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int) 
    result <- run i' 
    print result

來源

2011-11-02 Sal

已經有很多很好的答案了;我很少提供性能方面的，但有幾點風格。

首先，我會寫這樣說：

import Prelude as P 
import Data.Iteratee 
import qualified Data.Iteratee as I 
import qualified Data.Iteratee.IO as I 
import qualified Data.ByteString as B 
import Data.Char 
import System.Environment 

-- numLines has a concrete stream type so it's not necessary to provide an 
-- annotation later. It could have a more general type. 
numLines :: Monad m => I.Iteratee B.ByteString m Int 
numLines = I.foldl' step 0 
where 
    --step :: Int -> Word8 -> Int 
    step acc el = if el == (fromIntegral $ ord '\n') then acc + 1 else acc 

main = do 
    f:_ <- getArgs 
    words <- run =<< I.enumFile 65536 f numLines 
    print words

最大的不同是，這裏使用Data.Iteratee.ListLike.foldl'。請注意，只有各個流元素對step功能有影響，而不是流類型。它的功能與您使用的功能完全相同。 Data.ByteString.Lazy.foldl'。

使用foldl'也意味着您不需要手動編寫與liftI迭代。除非絕對必要，否則我會阻止用戶這樣做。結果通常更長，更難維護，幾乎沒有任何好處。

最後，我已經顯着增加了緩衝區大小。在我的系統上，這個速度比enumerator的默認值4096稍微快一些，這比默認的1024更快一些（迭代）。YMMV當然是這樣設置的。

來源

2011-11-03 19:17:40

謝謝你，約翰。非常有用的反饋。我的目的是瞭解如何使用基本構建塊來編寫它們，以便我能夠理解迭代器。您的反饋有助於瞭解如何編寫超出玩具代碼的代碼。 – Sal

，您可以使用從Data.ByteString的count功能，相關的步驟將被

step !i (Chunk xs) = liftI (step $ i + count 10 xs)

（或許還有一個fromIntegral）。 Data.ByteString.count非常快，不應該比wc -l慢得多。

來源

2011-11-02 20:01:03

所以我做了一些試驗，我得到了一個wc -l，它的速度只有「wc -l」的兩倍，這比上面顯示的wc -c版本更好。

{-# LANGUAGE OverloadedStrings #-} 

import qualified Data.ByteString.Lazy.Char8 as BSL 
import qualified Data.ByteString.Char8 as BS 
import qualified Data.Enumerator as E 
import qualified Data.Enumerator.Binary as EB 
import Control.Monad.IO.Class (liftIO) 
import Data.Int 

numlines :: Int64 -> E.Iteratee BS.ByteString IO() 
numlines n = do 
    chunk <- EB.take 1024 
    case chunk of 
     "" -> do liftIO $ print n 
       return() 
     a -> do let ct = BSL.count '\n' a 
       numlines (n+ct) 

main = do 
    let i = EB.enumFile "/usr/share/dict/words" E.$$ numlines 0 
    E.run_ i

運行它與本土：

Eriks-MacBook-Air:skunk erikhinton$ time wc -l "/usr/share/dict/words" 
    235886 /usr/share/dict/words 

real 0m0.009s 
user 0m0.006s 
sys 0m0.002s 
Eriks-MacBook-Air:skunk erikhinton$ time ./wcl 
235886 

real 0m0.019s 
user 0m0.013s 
sys 0m0.005s

[編輯]

這裏有一個更快，更小的佔地面積和做的更加簡潔/表達方式。這些統計員開始變得有趣。

{-# LANGUAGE OverloadedStrings #-} 

import qualified Data.ByteString.Lazy.Char8 as BSL 
import qualified Data.ByteString.Char8 as BS 
import qualified Data.Enumerator as E 
import qualified Data.Enumerator.Binary as EB 
import qualified Data.Enumerator.List as EL 
import Control.Monad.IO.Class (liftIO) 
import Data.Int 

numlines :: E.Iteratee BS.ByteString IO() 
numlines = do 
      num <- EL.fold (\n b -> (BS.count '\n' b) + n) 0 
      liftIO . print $ num 

main = do 
    let i = EB.enumFile "/usr/share/dict/words" E.$$ numlines 
    E.run_ i

而且

Eriks-MacBook-Air:skunk erikhinton$ time ./wcl2 
235886 

real 0m0.015s 
user 0m0.010s 
sys 0m0.004s

來源

2011-11-03 03:26:37

謝謝，Erik。我想出瞭如何解決類型錯誤，並且使用iteratee庫獲得了類似於haskell vs native的相對性能。剛剛發佈了固定代碼。 – Sal

很高興聽到它。剛剛發佈了比我之前版本更快，更小的版本。相當滿意這一個的簡潔。希望Haskell嚮導能夠將其翻轉過來。 –

哇，它有很好的表現，在Linux x86_64的：'$時間./wcl 真正0m0.039s 用戶0m0.024s SYS 0m0.008s $時間WC -l在/ usr /共享/字典/單詞 479623在/ usr /共享/字典/單詞真正0m0.037s 用戶0m0.012s SYS 0m0.009s – Sal

我想通了，如何解決此類錯誤的時機。到定影型錯誤關鍵是理解的關係Data.ListLike.filter和之間字節串正被傳遞給濾波器輸入。這裏是Data.ListLike.filter類型：

Data.ListLike.filter 
:: Data.ListLike.Base.ListLike full item => 
(item -> Bool) -> full -> full

全指流中的枚舉的情況下/ iteratee，如果我理解正確。 item引用流的元素。

現在，如果我們想要在輸入文件中對新行進行過濾，我們必須知道輸入文件流的類型以及該流中元素的類型。在這種情況下，輸入文件正在被讀取爲ByteString流。ByteString被記錄爲Word8矢量的空間高效表示。所以，項目這裏輸入的是Word8。因此，當我們編寫過濾器時，在step函數中，我們必須確保Bool操作是爲Word8定義的，因爲這是傳遞給過濾器的項目類型（如上所述）。我們正在篩選換行符。因此，像下面這些構建新行的Word8表示，並檢查是否有針對型Word8的X平等的一個布爾函數，應該工作：

\x -> x == Data.ByteString.Internal.c2w '\n'

還有一個更缺少的部分 - 對某些原因，編譯器（v7.0.3 Mac）無法推斷numfile文件類型簽名中的el的類型（如果有人對此有何看法，請做討論）。所以，明確地告訴它它是Word8解決了編譯問題：

numlines :: (Monad m, Num a, LL.ListLike s Word8) => Iteratee s m a

下面的完整代碼 - 它編譯，運行速度非常快。

{-# LANGUAGE BangPatterns,FlexibleContexts #-} 
import Data.Iteratee as I 
import Data.ListLike as LL 
import Data.Iteratee.IO 
import Data.ByteString 
import GHC.Word (Word8) 
import Data.ByteString.Internal (c2w) 

numlines :: (Monad m, Num a, LL.ListLike s Word8) => Iteratee s m a 
numlines = liftI $ step 0 
    where 
    step !i (Chunk xs) = let newline = c2w '\n' in liftI (step $i + fromIntegral (LL.length $ LL.filter (\x -> x == newline) xs)) 
    step !i stream = idone i stream 
{-# INLINE numlines #-} 


main = do 
    i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int) 
    result <- run i' 
    print result 
{- Time to run on mac OSX: 

$ time ./test ## above compiled program: ghc --make -O2 test.hs 
235886 

real 0m0.011s 
user 0m0.007s 
sys 0m0.004s 
$ time wc -l /usr/share/dict/words 
235886 /usr/share/dict/words 

real 0m0.005s 
user 0m0.002s 
sys 0m0.002s 
-}

來源

2011-11-03 11:03:39 Sal

使用Iteratee庫編寫「wc -l」 - 如何篩選換行符？

回答

相關問題