從去除收盤</img>標籤

我解析與Jsoup.parse一段HTML的預防Jsoup.parse。從去除收盤</img>標籤

其他一切是偉大的，但我應該在PDF轉換後解析這個網站。

出於某種原因，Jsoup.parse刪除結束標記和PDF解析器拋出約缺少的結束img標籤例外。

Can't load the XML resource (using TRaX transformer). org.xml.sax.SAXParseException; 
lineNumber: 115; columnNumber: 4; The element 
type "img" must be terminated by the matching end-tag "</img>"

如何防止Jsoup.parse刪除關閉img標籤？

例如這條線：

<img src="C:\path\to\image\image.png"></img>

變爲：

<img src="C:\path\to\image\image.png">

同樣的，發生：

<img src="C:\path\to\image\image.png"/>

下面的代碼：

private void createPdf(File file, String content) throws IOException, DocumentException { 
     OutputStream os = new FileOutputStream(file); 
      content = tidyUpHTML(content); 
      ITextRenderer renderer = new ITextRenderer(); 
      renderer.setDocumentFromString(content); 
      renderer.layout(); 
      renderer.createPDF(os); 
     os.close(); 
    }

這裏的tidyUpHTML-方法被調用上述方法：

private String tidyUpHTML(String html) { 
    org.jsoup.nodes.Document doc = Jsoup.parse(html); 
    doc.select("a").unwrap(); 
    String fixedTags = doc.toString().replace("<br>", "<br />"); 
    fixedTags = fixedTags.replace("<hr>", "<hr />"); 
    fixedTags = fixedTags.replaceAll("&nbsp;","&#160;"); 
    return fixedTags; 
}

來源

2016-12-08 Steve Waters

能否請您發表您的Jsoup解析的代碼，這樣我們就可以明白爲什麼它刪除結束標記。 – SachinSarawgi

@SachinSarawgi，已更新 –

你的PDF轉換期待XHTML（因爲它預計截止img標籤）。設置Jsoup代替輸出到xhtml（xml）。

org.jsoup.nodes.Document doc = Jsoup.parse(html); 
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml); 
doc.select("a").unwrap(); 
String fixedTags = doc.html();

見Is it possible to convert HTML into XHTML with Jsoup 1.8.1?

來源

2016-12-08 13:38:01

從去除收盤</img>標籤

回答

相關問題