2012-04-17 106 views
0

所以我的問題幾乎與this previous StackOverflow question一樣,但我在問這個問題,因爲我不喜歡接受的答案。Scala:解析連接的XML文檔

我有串聯的XML文檔的文件:

<?xml version="1.0" encoding="UTF-8"?> 
<someData>...</someData> 
<?xml version="1.0" encoding="UTF-8"?> 
<someData>...</someData> 
... 
<?xml version="1.0" encoding="UTF-8"?> 
<someData>...</someData> 

我想分析出每一個。

據我所知,我不能使用scala.xml.XML,因爲這取決於每個文件/字符串模型的一個文檔。

是否有Parser的子類我可以使用它來解析輸入源中的XML文檔嗎?因爲那樣我就可以做一些像many1 xmldoc或其他類似的東西。

+0

這個問題是重複的,除非你解釋_why_你不喜歡其他答案。說明沒有你提出的類型的解析器是不夠的IMO完整的問題/答案。 – 2012-04-17 19:01:09

+0

@RexKerr:公平點。我發現接受的答案是不可接受的,因爲「打破'<?xml'」讓我感到[用正則表達式解析XML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except -xhtml-self-contained-tags/1732454#1732454),因爲標記計數(因爲存在'<![CDATA [') – rampion 2012-04-17 20:59:41

回答

0

好吧,我想出了一個答案,我更高興。

基本上我嘗試解析使用SAXParser的XML,就像scala.xml.XML.load做,但小心SAXParseException s表示表示,解析器在錯誤的地方遇到了<?xml

然後,我抓取已經解析的任何根元素,將輸入倒回到足夠的位置,然後從那裏重新開始解析。

// An input stream that can recover from a SAXParseException 
object ConcatenatedXML { 
    // A reader that can be rolled back to the location of an exception 
    class Relocator(val re : java.io.Reader) extends java.io.Reader { 
    var marked = 0 
    var firstLine : Int = 1 
    var lineStarts : IndexedSeq[Int] = Vector(0) 
    override def read(arr : Array[Char], off : Int, len : Int) = { 
     // forget everything but the start of the last line in the 
     // previously marked area 
     val pos = lineStarts(lineStarts.length - 1) - marked 
     firstLine += lineStarts.length - 1 

     // read the next chunk of data into the given array 
     re.mark(len) 
     marked = re.read(arr,off,len) 

     // find the line starts for the lines in the array 
     lineStarts = pos +: (for (i <- 0 until marked if arr(i+off) == '\n') yield (i+1)) 

     marked 
    } 
    override def close { re.close } 
    override def markSupported = false 
    def relocate(line : Int, col : Int , off : Int) { 
     re.reset 
     val skip = lineStarts(line - firstLine) + col + off 
     re.skip(skip) 
     marked = 0 
     firstLine = 1 
     lineStarts = Vector(0) 
    } 
    } 

    def parse(str : String) : List[scala.xml.Node] = parse(new java.io.StringReader(str)) 
    def parse(re : java.io.Reader) : List[scala.xml.Node] = parse(new Relocator(re)) 

    // parse all the concatenated XML docs out of a file 
    def parse(src : Relocator) : List[scala.xml.Node] = { 
    val parser = javax.xml.parsers.SAXParserFactory.newInstance.newSAXParser 
    val adapter = new scala.xml.parsing.NoBindingFactoryAdapter 

    adapter.scopeStack.push(scala.xml.TopScope) 
    try { 

     // parse this, assuming it's the last XML doc in the string 
     parser.parse(new org.xml.sax.InputSource(src), adapter) 
     adapter.scopeStack.pop 
     adapter.rootElem.asInstanceOf[scala.xml.Node] :: Nil 

    } catch { 
     case (e : org.xml.sax.SAXParseException) => { 
     // we found the start of another xmldoc 
     if (e.getMessage != """The processing instruction target matching "[xX][mM][lL]" is not allowed.""" 
      || adapter.hStack.length != 1 || adapter.hStack(0) == null){ 
      throw(e) 
     } 

     // tell the adapter we reached the end of a document 
     adapter.endDocument 

     // grab the current root node 
     adapter.scopeStack.pop 
     val node = adapter.rootElem.asInstanceOf[scala.xml.Node] 

     // reset to the start of this doc 
     src.relocate(e.getLineNumber, e.getColumnNumber, -6) 

     // and parse the next doc 
     node :: parse(src) 
     } 
    } 
    } 
} 

println(ConcatenatedXML.parse(new java.io.BufferedReader(
    new java.io.FileReader("temp.xml") 
))) 
println(ConcatenatedXML.parse(
    """|<?xml version="1.0" encoding="UTF-8"?> 
    |<firstDoc><inner><innerer><innermost></innermost></innerer></inner></firstDoc> 
    |<?xml version="1.0" encoding="UTF-8"?> 
    |<secondDoc></secondDoc> 
    |<?xml version="1.0" encoding="UTF-8"?> 
    |<thirdDoc>...</thirdDoc> 
    |<?xml version="1.0" encoding="UTF-8"?> 
    |<lastDoc>...</lastDoc>""".stripMargin 
)) 
try { 
    ConcatenatedXML.parse(
    """|<?xml version="1.0" encoding="UTF-8"?> 
     |<firstDoc> 
     |<?xml version="1.0" encoding="UTF-8"?> 
     |</firstDoc>""".stripMargin 
) 
    throw(new Exception("That should have failed")) 
} catch { 
    case _ => println("catches really incomplete docs") 
} 
0

如果您關注的是安全性,你可以用獨特的標籤包裝你的大塊:

def mkTag = "block"+util.Random.alphanumeric.take(20).mkString 
val reader = io.Source.fromFile("my.xml") 
def mkChunk(it: Iterator[String], chunks: Vector[String] = Vector.empty): Vector[String] = { 
    val (chunk,extra) = it.span(s => !(s.startsWith("<?xml") && s.endsWith("?>")) 
    val tag = mkTag 
    def tagMe = "<"+tag+">"+chunk.mkString+"</"+tag+">" 
    if (!extra.hasNext) chunks :+ tagMe 
    else if (!chunk.hasNext) mkChunk(extra, chunks) 
    else mkChunk(extra, chunks :+ tagMe) 
} 
val chunks = mkChunk(reader.getLines()) 
reader.close 
val answers = xml.XML.fromString("<everything>"+chunks.mkString+"</everything>") 
// Now take apart the resulting parse 

既然你已經提供了獨特的封閉標籤,它是可能的,如果有人已經嵌入文字,你將有一個解析錯誤XML標籤在某處,但你不會意外得到錯誤的解析數。

(警告:代碼類型,但不檢查的話 - 它給的想法,不完全正確的行爲)