從原始文本文件中獲取所有XML？

我有日誌文件，我需要編寫從這個文件獲取所有xml的程序。文件看起來像從原始文本文件中獲取所有XML？

text 
text 
xml 
text 
xml 
text 
etc

你能不能給我建議什麼是更好地使用正則表達式或其他什麼東西？也許可以用dom4j來做到這一點？
如果我會嘗試使用正則表達式，我看到下一個問題，文本部分有<>標籤。

更新1： XML實例

SOAP message: 
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"> 
<soapenv:Body> 
here is body part of valid xml 
</soapenv:Body> 
</soapenv:Envelope> 
text,text,text,text 
symbols etc 
    SOAP message: 
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"> 
<soapenv:Body> 
here is body part of valid xml 
</soapenv:Body> 
</soapenv:Envelope> 
text,text,text,text 
symbols etc

感謝。

來源

2012-11-26 Ishikawa Yoshi

^[A-ZA-Z] [A-ZA -z] {0,4} + [\ n] * $ –

如果每個這樣的部分是在單獨的行中，那麼它應該是相當簡單：

s = s.replaceAll("(?m)^\\s*[^<].*\\n?", "");

來源

2012-11-26 14:01:47

邏輯是獲取所有XML，帶有標籤並忘記文件中的其他文本 –

我的代碼將刪除所有不是XML的文本，所以......？ –

是啊，我明白了，但是如果xml不在單獨的行中呢？ –

如果你的XMl總是在一行上，那麼你可以迭代檢查它是否以<開頭。如果是這樣，嘗試將整行解析爲DOM。

String xml = "hello\n" + // 
     "this is some text\n" + // 
     "<foo>I am XML</foo>\n" + // 
     "<bar>me too!</bar>\n" + // 
     "foo is bar\n" + // 
     "<this is not valid XML\n" + // 
     "<foo><bar>so am I</bar></foo>\n"; 
List<Document> docs = new ArrayList<Document>(); // the documents we can find 
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance(); 
DocumentBuilder docBuilder = docFactory.newDocumentBuilder(); 
for (String line : xml.split("\n")) { 
    if (line.startsWith("<")) { 
     try { 
      ByteArrayInputStream bis = new ByteArrayInputStream(line.getBytes()); 
      Document doc = docBuilder.parse(bis); 
      docs.add(doc); 
     } catch (Exception e) { 
      System.out.println("Problem parsing line: `" + line + "` as XML"); 
     } 
    } else { 
     System.out.println("Discarding line: `" + line + "`"); 
    } 
} 
System.out.println("\nFound " + docs.size() + " XML documents."); 
Transformer transformer = TransformerFactory.newInstance().newTransformer(); 
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes"); 
for (Document doc : docs) { 
    StringWriter sw = new StringWriter(); 
    transformer.transform(new DOMSource(doc), new StreamResult(sw)); 
    String docAsXml = sw.getBuffer().toString().replaceAll("</?description>", ""); 
    System.out.println(docAsXml); 
}

輸出：

Discarding line: `hello` 
Discarding line: `this is some text` 
Discarding line: `foo is bar` 
Problem parsing line: `<this is not valid XML` as XML 

Found 3 XML documents. 
<foo>I am XML</foo> 
<bar>me too!</bar> 
<foo><bar>so am I</bar></foo>

來源

2012-11-26 13:40:30 Alex

從原始文本文件中獲取所有XML？

回答

相關問題