2012-03-30 72 views
4

我決定自學自己如何使用Parsec,我用自己指定的玩具項目打了一個路障。Haskell:爲什麼我的解析器不能正確回溯?

我試圖解析HTML,具體是:

<html> 
    <head> 
    <title>Insert Clever Title</title> 
    </head> 
    <body> 
    What don't you like? 
    <select id="some stuff"> 
     <option name="first" font="green">boilerplate</option> 
     <option selected name="second" font="blue">parsing HTML with regexes</option> 
     <option name="third" font="red">closing tags for option elements 
    </select> 
    That was short. 
    </body> 
</html> 

我的代碼是:

{-# LANGUAGE FlexibleContexts, RankNTypes #-} 
module Main where 

import System.Environment (getArgs) 
import Data.Map hiding (null) 
import Text.Parsec hiding ((<|>), label, many, optional) 
import Text.Parsec.Token 
import Control.Applicative 

data HTML = Element { tag :: String, attributes :: Map String (Maybe String), children :: [HTML] } 
      | Text { contents :: String } 
    deriving (Show, Eq) 

type HTMLParser a = forall s u m. Stream s m Char => ParsecT s u m a 

htmlDoc :: HTMLParser HTML 
htmlDoc = do 
    spaces 
    doc <- html 
    spaces >> eof 
    return doc 

html :: HTMLParser HTML 
html = text <|> element 

text :: HTMLParser HTML 
text = Text <$> (many1 $ noneOf "<") 

label :: HTMLParser String 
label = many1 . oneOf $ ['a' .. 'z'] ++ ['A' .. 'Z'] 

value :: HTMLParser String 
value = between (char '"') (char '"') (many anyChar) <|> label 

attribute :: HTMLParser (String, Maybe String) 
attribute = (,) <$> label <*> (optionMaybe $ spaces >> char '=' >> spaces >> value) 

element :: HTMLParser HTML 
element = do 
    char '<' >> spaces 
    tag <- label 
    -- at least one space between each attribute and what was before 
    attributes <- fromList <$> many (space >> spaces >> attribute) 
    spaces >> char '>' 
    -- nested html 
    children <- many html 
    optional $ string "</" >> spaces >> string tag >> spaces >> char '>' 
    return $ Element tag attributes children 

main = do 
    source : _ <- getArgs 
    result <- parse htmlDoc source <$> readFile source 
    print result 

這個問題似乎是我的解析器不喜歡關閉的標籤 - 它似乎被貪婪地假設<總是意味着一個開始標籤(據我可以告訴):

% HTMLParser temp.html 
Left "temp.html" (line 3, column 32): 
unexpected "/" 
expecting white space 

我我一直在玩這個遊戲,我不確定爲什麼它不會回溯過去的比賽。

+6

秒差距只有回溯失敗。 – ehird 2012-03-30 18:08:01

+1

有時甚至不會 - - 。 Attoparsec在這方面更糟糕。 – 2012-03-31 04:29:26

回答

2

像ehird說,我需要使用try:如果你使用`try`

attribute = (,) <$> label <*> (optionMaybe . try $ spaces >> char '=' >> spaces >> value) 
--... 
attributes <- fromList <$> many (try $ space >> spaces >> attribute) 
--... 
children <- many $ try html 
optional . try $ string "</" >> spaces >> string tag >> spaces >> char '>' 
相關問題