如何將Jsoup文檔轉換爲W3C文檔？

我通過解析內部HTML頁面建立一個Jsoup文檔，如何將Jsoup文檔轉換爲W3C文檔？

public Document newDocument(String path) throws IOException { 

    Document doc = null; 
    doc = Jsoup.connect(path).timeout(0).get(); 
      return new HtmlDocument<Document>(doc); 
}

解析時，我會想的Jsoup文檔轉換爲我org.w3c.dom.Document 我用這一個可用庫DOMBuilder，但我得到org.w3c.dom.Document爲空。我無法理解這個問題，試圖搜索，但無法找到任何答案。

代碼，以生成W3C DOM文檔：

Document jsoupDoc=factory.newDocument("http:localhost/testcases/test_2.html")); 
org.w3c.dom.Document docu= DOMBuilder.jsoup2DOM(jsoupDoc);

任何人都可以請幫我在這？

來源

2013-07-23 chaosguru

http://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/htmlextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/impl/DOMBuilder的.java –

To retrieve a jsoup document via HTTP，致電Jsoup.connect(...).get()。 To load a jsoup document locally，致電Jsoup.parse(new File("..."), "UTF-8")。

致電DomBuilder是正確的。

當你說，

我用這一個可用庫DOMBuilder但在解析時我獲得org.w3c.dom.Document中爲空。

我想你的意思是，「我使用了一個可用的庫，DOMBuilder，但是當打印結果時，我得到[#document: null]。」至少，這是我在嘗試打印w3cDoc對象時看到的結果 - 但這並不意味着該對象爲空。我能夠通過撥打電話getDocumentElement和getChildNodes來遍歷文檔。

public static void main(String[] args) { 
    Document jsoupDoc = null; 

    try { 
     jsoupDoc = Jsoup.connect("http://stackoverflow.com/questions/17802445").get(); 
    } catch (IOException e) { 
     e.printStackTrace(); 
    } 

    org.w3c.dom.Document w3cDoc= DOMBuilder.jsoup2DOM(jsoupDoc); 
    Element e = w3cDoc.getDocumentElement(); 
    NodeList childNodes = e.getChildNodes(); 
    Node n = childNodes.item(2); 
    System.out.println(n.getNodeName()); 
}

來源

2013-09-25 20:08:10

或者，Jsoup提供與方法fromJsoup的W3CDom類。此方法將Jsoup文檔轉換爲W3C文檔。

Document jsoupDoc = ... 
W3CDom w3cDom = new W3CDom(); 
org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(jsoupDoc);

UPDATE：

由於1.10.3 W3CDom是no longer experimental。
到Jsoup 1.10.2 W3CDom類仍然是實驗性的。

來源

2015-05-15 11:44:03 Stephan

如何將Jsoup文檔轉換爲W3C文檔？

回答

相關問題