調試HXT性能問題

我試圖使用HXT閱讀一些大的XML數據文件（數百MB）調試HXT性能問題

我的代碼中有空格泄漏地方，但我似乎無法到找到它。由於我對ghc分析工具鏈的知識非常有限，我對於發生了什麼有一點線索。

基本上，文檔被解析，但沒有評估。

下面是一些代碼：

{-# LANGUAGE Arrows, NoMonomorphismRestriction #-} 

import Text.XML.HXT.Core 
import System.Environment (getArgs) 
import Control.Monad (liftM) 

main = do file <- (liftM head getArgs) >>= parseTuba 
      case file of(Left m) -> print "Failed." 
         (Right _) -> print "Success." 

data Sentence t = Sentence [Node t] deriving Show 
data Node t = Word { wSurface :: !t } deriving Show 

parseTuba :: FilePath -> IO (Either String ([Sentence String])) 
parseTuba f = do r <- runX (readDocument [] f >>> process) 
       case r of 
         [] -> return $ Left "No parse result." 
         [pr] -> return $ Right pr 
         _ -> return $ Left "Ambiguous parse result!" 

process :: (ArrowXml a) => a XmlTree ([Sentence String]) 
process = getChildren >>> listA (tag "sentence" >>> listA word >>> arr (\ns -> Sentence ns)) 

word :: (ArrowXml a) => a XmlTree (Node String) 
word = tag "word" >>> getAttrValue "form" >>> arr (\s -> Word s) 

-- | Gets the tag with the given name below the node. 
tag :: (ArrowXml a) => String -> a XmlTree XmlTree 
tag s = getChildren >>> isElem >>> hasName s

我想讀一個文集文件，且結構明顯類似<corpus><sentence><word form="Hello"/><word form="world"/></sentence></corpus>。

即使是在非常小的開發主體，程序需要15秒來讀取它，其中約20％的GC時間（這是太多。）

特別是，很多數據都是在DRAG狀態花費太多時間。這是簡介：

監控DRAG罪魁禍首。你可以看到decodeDocument被調用了很多，然後它的數據被暫停，直到執行結束。

現在，我想這應該通過摺疊所有這decodeDocument東西進入我的數據結構（Sentence和Word），然後RT可以瞭解這些的thunk忘記很容易固定。它目前正在發生，雖然這樣的，是摺疊發生在非常年底當我在IO單子，它可以很容易地發生在線力評價由Either解構。我認爲沒有理由這樣做，而我迄今爲止試圖嚴格執行該計劃的努力一直是徒勞的。我希望有人能幫助我:-)

我甚至不能找出太多的地方把seq S和$! S IN ...

來源

2011-04-04 Aleksandar Dimitrov

一個可能的事嘗試：默認HXT解析器是嚴格的，但確實存在基於tagsoup一個懶惰解析器：http://hackage.haskell.org/package/hxt-tagsoup

在瞭解到外籍人士可以做懶的處理，以及：http://hackage.haskell.org/package/hxt-expat

你可能想看看是否切換解析後端，其本身而言，解決了你的問題。

來源

2011-04-04 14:26:43 sclv

那麼，我實際上選擇了HexX的HXT，因爲我可以做相當靈活和定義明確的XML解析。我寫了一個在線解析器，它使用Hexpat頂部的polyParse。這是它的工作，但不像HXT那樣容易擴展和調試。我會給hxt-tagsoup一個嘗試。 – 2011-04-05 00:03:17

調試HXT性能問題

回答

相關問題