2012-11-26 92 views
1

我有日誌文件,我需要編寫從這個文件獲取所有xml的程序。 文件看起來像從原始文本文件中獲取所有XML?

text 
text 
xml 
text 
xml 
text 
etc 

你能不能給我建議什麼是更好地使用正則表達式或其他什麼東西? 也許可以用dom4j來做到這一點?
如果我會嘗試使用正則表達式,我看到下一個問題,文本部分有<>標籤。

更新1: XML實例

SOAP message: 
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"> 
<soapenv:Body> 
here is body part of valid xml 
</soapenv:Body> 
</soapenv:Envelope> 
text,text,text,text 
symbols etc 
    SOAP message: 
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"> 
<soapenv:Body> 
here is body part of valid xml 
</soapenv:Body> 
</soapenv:Envelope> 
text,text,text,text 
symbols etc 

感謝。

+0

^[A-ZA-Z] [A-ZA -z] {0,4} + [\ n] * $ –

回答

1

如果每個這樣的部分是在單獨的行中,那麼它應該是相當簡單:

s = s.replaceAll("(?m)^\\s*[^<].*\\n?", ""); 
+0

邏輯是獲取所有XML,帶有標籤並忘記文件中的其他文本 –

+0

我的代碼將刪除所有不是XML的文本,所以......? –

+0

是啊,我明白了,但是如果xml不在單獨的行中呢? –

1

如果你的XMl總是在一行上,那麼你可以迭代檢查它是否以<開頭。如果是這樣,嘗試將整行解析爲DOM。

String xml = "hello\n" + // 
     "this is some text\n" + // 
     "<foo>I am XML</foo>\n" + // 
     "<bar>me too!</bar>\n" + // 
     "foo is bar\n" + // 
     "<this is not valid XML\n" + // 
     "<foo><bar>so am I</bar></foo>\n"; 
List<Document> docs = new ArrayList<Document>(); // the documents we can find 
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance(); 
DocumentBuilder docBuilder = docFactory.newDocumentBuilder(); 
for (String line : xml.split("\n")) { 
    if (line.startsWith("<")) { 
     try { 
      ByteArrayInputStream bis = new ByteArrayInputStream(line.getBytes()); 
      Document doc = docBuilder.parse(bis); 
      docs.add(doc); 
     } catch (Exception e) { 
      System.out.println("Problem parsing line: `" + line + "` as XML"); 
     } 
    } else { 
     System.out.println("Discarding line: `" + line + "`"); 
    } 
} 
System.out.println("\nFound " + docs.size() + " XML documents."); 
Transformer transformer = TransformerFactory.newInstance().newTransformer(); 
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes"); 
for (Document doc : docs) { 
    StringWriter sw = new StringWriter(); 
    transformer.transform(new DOMSource(doc), new StreamResult(sw)); 
    String docAsXml = sw.getBuffer().toString().replaceAll("</?description>", ""); 
    System.out.println(docAsXml); 
} 

輸出:

Discarding line: `hello` 
Discarding line: `this is some text` 
Discarding line: `foo is bar` 
Problem parsing line: `<this is not valid XML` as XML 

Found 3 XML documents. 
<foo>I am XML</foo> 
<bar>me too!</bar> 
<foo><bar>so am I</bar></foo>