如何使用JSoup以HTML格式將' '更改爲' '

我使用JSoup解析HTML文件並刪除XML中無效的元素，因爲我需要將XSLT應用於文件。我遇到的問題是「nbsp;」存在於我的文檔中。我需要將它們改爲unicode'＃160;'以便我可以在文件上運行XSLT。如何使用JSoup以HTML格式將' '更改爲' '

所以我想：

<p> &nbsp; </p> 
<p> &nbsp; </p> 
<p> &nbsp; </p> 
<p> &nbsp; </p>

是：

<p> &#160; </p> 
<p> &#160; </p> 
<p> &#160; </p> 
<p> &#160; </p>

我試圖用文字代替，但它沒有工作：執行解析

Elements els = doc.body().getAllElements(); 
for (Element e : els) { 
    List<TextNode> tnList = e.textNodes(); 
    for (TextNode tn : tnList){ 
     String orig = tn.text(); 
     tn.text(orig.replaceAll("&nbsp;","&#160;")); 
    } 
}

代碼：

File f = new File ("C:/Users/jrothst/Desktop/Test File.htm"); 

Document doc = Jsoup.parse(f, "UTF-8"); 
doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml); 
System.out.println("Starting parse.."); 
performConversion(doc); 

String html = doc.toString(); 
System.out.println(html); 
FileUtils.writeStringToFile(f, doc.outerHtml(), "UTF-8");

如何使用JSoup庫使這些更改發生？

來源

2016-07-26 Justin

以下爲我工作。你不需要做任何手動搜索和替換：

File f = new File ("C:/Users/seanbright/Desktop/Test File.htm"); 

Document doc = Jsoup.parse(f, "UTF-8"); 
doc.outputSettings() 
    .syntax(Document.OutputSettings.Syntax.xml) 
    .escapeMode(Entities.EscapeMode.xhtml); 

System.out.println(doc.toString());

輸入：

<html><head></head><body>&nbsp;</body></html>

輸出：

<html><head></head><body>&#xa0;</body></html>

（ 是一回事 只在十六進制，而不是十進制）

來源

2016-07-26 18:06:00

非常棒的答案，比查找和替換簡單得多。謝謝！ – Justin

如何使用JSoup以HTML格式將' '更改爲' '

回答

相關問題