2016-08-18 60 views
0

我想解析下面的測試數據:它適用於3種情況,所以我認爲我的正則表達式存在問題。如果一行以#開頭,並且註釋也以#開始,它將停止工作。有人能解釋爲什麼嗎?這是我的解決方案迄今...如何使用RegexParser正確解析文本文件?

val testDate = 
    """ 
    |127.0.0.1 ads234.com 
    |#127.0.0.1 auto.search.msn.com # Microsoft uses this server to redirect 
    |#127.0.0.1 sitefinder.verisign.com # Verisign has joined the game 
    |#127.0.0.1 sitefinder-idn.verisign.com # of trying to hijack mistyped 
    |#127.0.0.1 s0.2mdn.net  # This may interfere with some streaming 
    |#127.0.0.1 ad.doubleclick.net # This may interfere with www.sears.com 
    |127.0.0.1 media.fastclick.net # Likewise, this may interfere with some 
    |127.0.0.1 cdn.fastclick.net 
    """.stripMargin 

我想保留#和評論,如果有的話。

object Example extends RegexParsers { 
    def comment: Parser[String] = """#.*""".r 
    def url: Parser[String] = """[A-Za-z0-9-\.\_\-]{1,65}(?<!-)\.+[A-Za-z]{2,7}""".r 
    def localhost: Parser[String] = """\b(\d{1,3}\.){3}\d{1,3}\b""".r 
    def pound: Parser[String] = "#".r 
    def port: Parser[String] = """:\d{3}""".r 

    def urlPort = url | url <~ port 

    def pos1 = localhost ~ urlPort ^^ { 
    case _ ~ dns => LineParsed("", dns, "") 
    } 
    def pos2 = pound ~ localhost ~ urlPort ^^ { 
    case p ~ _ ~ dns => LineParsed(p, dns, "") 
    } 
    def pos3 = localhost ~ urlPort ~ comment ^^ { 
    case _ ~ dns ~ com => LineParsed("", dns, com) 
    } 
    def pos4 =enter code here pound ~ localhost ~ urlPort ~ comment ^^ { 
    case p ~ _ ~ dns ~ com => LineParsed(p, dns, com) 
    } 

    def linePos = pos1 | pos2 | pos3 | pos4 

    def fullLine = repsep(linePos, """\W*""".r) 
} 

得到了以下異常:

#127.0.0.1 auto.search.msn.com # Microsoft uses this server to redirect 

           ^
    java.lang.RuntimeException: No result when parsing failed 

回答

1

有代碼中的一些錯誤。首先,默認情況下換行符被計爲空格,但您需要「查看」它們才能正確地打破條目。所以,你需要重新定義空格:

object Example extends RegexParsers { 
    override protected val whiteSpace: Regex = "[ \t]+".r 

fullLine分析器,然後寫成:

//allow several empty lines at the beginning and between entries 
    def fullLine = rep("\n") ~> repsep(linePos, rep1("\n")) 

(另一種辦法是預先分割線,並分別分析它們)

下錯誤是您將解析器與|結合在一起的方式。要解析A(可選)後跟B,請不要編寫A | A ~ B。在閱讀A後,它不會嘗試讀取B,因爲左側已經成功。寫來代替:A ~ B.?

def urlPort = url <~ port.? // But anyway, you'll neve have a port in a host file ! 

以同樣的方式,4例pos1 | pos2 | pos3 | pos4可以大大簡化:

def linePos = pound.? ~ localhost ~ urlPort ~ comment.? ^^ { 
    case p ~ _ ~ dns ~ com ⇒ LineParsed(p.getOrElse(""), dns,com.getOrElse("")) 
    } 

你可以在這裏看到?組合子是如何給你回一個Optionpcom。我使用getOrElse來適應LineParsed的結構並保持您的代碼的原始行爲,但更多的scala-ish方法將保留爲LineParsed類中的一個選項。

下面是分析你的例子最後的工作代碼:

object Example extends RegexParsers { 
    override protected val whiteSpace: Regex = "[ \t]+".r 
    def comment: Parser[String] = """#.*""".r 
    def url: Parser[String] = """[A-Za-z0-9-\.\_\-]{1,65}(?<!-)\.+[A-Za-z]{2,7}""".r 
    def localhost: Parser[String] = """\b(\d{1,3}\.){3}\d{1,3}\b""".r 
    def pound: Parser[String] = "#".r 
    def port: Parser[String] = """:\d{3}""".r 
    def urlPort = url <~ port.? 

    def linePos = pound.? ~ localhost ~ urlPort ~ comment.? ^^ { 
    case p ~ _ ~ dns ~ com ⇒ LineParsed(p.getOrElse(""), dns, com.getOrElse("")) 
    } 

    def fullLine = rep("\n") ~> repsep(linePos, rep1("\n")) 
} 
+0

謝謝您的輸入。我試圖運行改進的代碼,但現在我得到一個異常說:「[10.3]失敗:字符串匹配正則表達式\\(\ d {1,3} \。){3} \ d {1,3} \ b'預計但發現源的結尾「。 – User1232187

+0

使用'parse'而不是'parseAll',或者在fullLine解析器的末尾使用空行 – thibr