Haskell更有效的方法來解析文件的位數

所以我有一個約8mb的文件，每個文件有6個整數，由一個空格分隔。Haskell更有效的方法來解析文件的位數

我解析這個電流的方法是：

tuplify6 :: [a] -> (a, a, a, a, a, a) 
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q) 

toInts :: String -> (Int, Int, Int, Int, Int, Int) 
toInts line = 
     tuplify6 $ map read stringNumbers 
     where stringNumbers = split " " line

和映射toInts超過

liftM lines . readFile

將返回我的元組的列表。但是，當我運行這個時，它需要將近25秒來加載文件並解析它。任何方式，我可以加快這一點？該文件只是純文本。

來源

2012-07-03 DantheMan

您能否提供更多信息：整個工作程序，輸入，運行方式，編譯方式（優化）還是在'ghci'中運行。你知道'Data.Bytestring'和'Data.Vector'。另外'讀'是很慢，至少這是我所聽到的。 – epsilonhalbe

另請參閱http://stackoverflow.com/questions/8366093/how-do-i-parse-a-matrix-of-integers-in-haskell/8366642 –

您可以使用ByteString s來加快速度，例如，

module Main (main) where 

import System.Environment (getArgs) 
import qualified Data.ByteString.Lazy.Char8 as C 
import Data.Char 

main :: IO() 
main = do 
    args <- getArgs 
    mapM_ doFile args 

doFile :: FilePath -> IO() 
doFile file = do 
    bs <- C.readFile file 
    let tups = buildTups 0 [] $ C.dropWhile (not . isDigit) bs 
    print (length tups) 

buildTups :: Int -> [Int] -> C.ByteString -> [(Int,Int,Int,Int,Int,Int)] 
buildTups 6 acc bs = tuplify6 acc : buildTups 0 [] bs 
buildTups k acc bs 
    | C.null bs = if k == 0 then [] else error ("Bad file format " ++ show k) 
    | otherwise = case C.readInt bs of 
        Just (i,rm) -> buildTups (k+1) (i:acc) $ C.dropWhile (not . isDigit) rm 
        Nothing -> error ("No Int found: " ++ show (C.take 100 bs)) 

tuplify6:: [a] -> (a, a, a, a, a, a) 
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q)

運行非常快：

$ time ./fileParse IntList 
200000 

real 0m0.119s 
user 0m0.115s 
sys  0m0.003s

爲8.1 MIB文件。

~~在另一方面，使用String S和轉換（一對夫婦的seq s到強制評估）也只用了0.66s，這樣的時間大部分似乎花費不解析，但與工作結果。~~

糟糕，錯過了seq因此read s沒有實際評估String版本。固定的是，String + read需要大約四秒鐘，略高於一個與自定義Int解析器@ Rotsor的評論

foldl' (\a c -> 10*a + fromEnum c - fromEnum '0') 0

這樣解析顯然沒有走的時間顯著量。

來源

2012-07-03 22:07:25

謝謝。我忘記了haskell惰性評估，所以我錯在時間問題的來源。但是也要感謝其他方法！ – DantheMan

你可以用'read'來顯示完成0.66s的整個程序嗎？我[問過類似的問題]（http://stackoverflow.com/questions/7510078/why-is-char-based-input-so-much-slower-than-the-char-based-output-in-哈斯克爾）之前，答案是「閱讀緩慢」。在這裏，僅僅用'foldl（\ a c - > a * 10 + fromEnum c - fromEnum'0'）替換'read'會使速度提高6倍，表明大部分時間都是通過解析來實現的。你是如何設法改進的？ – Rotsor

Haskell更有效的方法來解析文件的位數

回答

相關問題