2016-08-15 35 views
0

設置解析詞的首次出現未用空白

我需要找到一些.txt文件一個字不是由空格前的第一次出現precded。下面是可能情況:

-- * should succed 
t1 = "hello\t999\nworld\t\900" 
t2 = "world\t\900\nhello\t999\n" 
t3 = "world world\t\900\nhello\t999\n" 

-- * should fail 
t4 = "world\t\900\nhello world\t999\n" 
t5 = "hello world\t999\nworld\t\900" 
t6 = "world hello\t999\nworld\t\900" 

眼下T6正在取得成功,即使它應該失敗,因爲直到它到達你好我的解析器將消耗任何字符。這裏是我的解析器:

我的解決方案

import Control.Applicative 

import Data.Attoparsec.Text.Lazy 
import Data.Attoparsec.Combinator 
import Data.Text hiding (foldr) 
import qualified Data.Text.Lazy as L (Text, pack) 



-- * should succed 
t1 = L.pack "hello\t999\nworld\t\900" 
t2 = L.pack "world\t\900\nhello\t999\n" 

-- * should fail 
t3 = L.pack "world\t\900\nhello world\t999\n" 
t4 = L.pack "hello world\t999\nworld\t\900" 
t5 = L.pack "world hello\t999\nworld\t\900" 

p = occur "hello"  

---- * discard all text until word `w` occurs, and find its only field `n` 
occur :: String -> Parser (String, Int) 
occur w = do 
    pUntil w 
    string . pack $ w 
    string "\t" 
    n <- natural 
    string "\n" 
    return (w, read n) 


-- * Parse a natural number 
natural :: Parser String 
natural = many1' digit 

-- * skip over all words in Text stream until the word we want 
pUntil :: String -> Parser String 
pUntil = manyTill anyChar . lookAhead . string . pack 
+0

解析器是*不*對「發現的序列x的第一次出現合適的工具Y」。您應該將整個字符串解析爲一個數據結構,該數據結構存儲(鍵,值)對以及它們發生的位置。你目前的問題是't6'包含兩個鍵/值對(一個在整個字符串中,一個在後綴中),所以自然地一個回溯解析器找到兩者。解析每個密鑰無條件地解決這個問題。使用attoparsec,您僅限於獲取位置作爲字節索引,但這應該足以滿足您的需要。 – user2407038

回答

2

這裏有一個方法來考慮:

{-# LANGUAGE OverloadedStrings #-} 

import Control.Applicative 

import Data.Attoparsec.Text.Lazy 
import Data.Attoparsec.Combinator 
import Data.Text hiding (foldr) 
import qualified Data.Text.Lazy as L (Text, pack) 
import Data.Monoid 

natural = many1' digit 

-- manyTill anyChar (try $ char c <* eof) 

pair0 w = do 
    string (w <> "\t") 
    n <- natural 
    string "\n" 
    return n 

pair1 w = do 
    manyTill anyChar (try $ string ("\n" <> w <> "\t")) 
    n <- natural 
    string "\n" 
    return n 

pair w = pair0 w <|> pair1 w 

t1 = "hello\t999\nworld\t\900" 
t2 = "world\t\900\nhello\t999\n" 
t3 = "world world\t\900\nhello\t999\n" 

-- * should fail 
t4 = "world\t\900\nhello world\t999\n" 
t5 = "hello world\t999\nworld\t\900" 
t6 = "world hello\t999\nworld\t\900" 

test t = parseTest (pair "hello") (L.pack t) 

main = do 
    test t1; test t2; test t3 
    test t4; test t5; test t6 

的想法是,pair0在的開頭與給定值的一對匹配輸入和pair1匹配一對換行符後。

關鍵是使用manyTill anyChar (try p),它將繼續跳過 個字符,直到解析器p成功。

(順便說一句 - 我通過閱讀@Cactus書面答覆瞭解到這款採用manyTilltry。)