2
我有一個String
其中包含一些電子郵件的內容,我想從此String
刪除所有HTML編碼。使用Jsoup刪除所有HTML但保留行
這是我的時刻代碼:
public static String html2text(String html) {
Document document = Jsoup.parse(html);
document = new Cleaner(Whitelist.basic()).clean(document);
document.outputSettings().escapeMode(EscapeMode.xhtml);
document.outputSettings().charset("UTF-8");
html = document.body().html();
html = html.replaceAll("<br />", "");
splittedStr = html.split("Geachte heer/mevrouw,");
html = splittedStr[1];
html = "Geachte heer/mevrouw,"+html;
return html;
}
此方法刪除所有的HTML,不斷線且大部分佈局。但它也會返回一些&
和nbsp;
標籤,這些標籤並未完全刪除。請參閱下面的輸出,因爲您可以看到在String
中仍有一些標籤甚至是其中的一部分。我如何擺脫這些?
Loonheffingen &n= bsp; Naam
nr in administratie &nbs= p; meldingen
nummer
1 &n= bsp; = ; 0 &= nbsp; &nbs= p; 1
123456789L01
編輯:
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">De afgekeurde meldingen zijn opgenomen in de bijlage: Afgekeurde meldingen.</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">Wilt u zo spoedig mogelijk zorgdragen dat deze</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">meldingen gecorrigeerd worden aangeleverd?</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">mer</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">Volg Aantal verwerkt Aantal afgekeurde</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> Loonheffingen Naam</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">nr in administratie meldingen</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> nummer</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"><span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">1 0 1</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
這是HTML我試圖解析的一部分。我想刪除所有的HTML,但保留原始電子郵件的佈局。
任何幫助表示讚賞,
謝謝!
解決
Document xmlDoc = Jsoup.parse(file, "", Parser.xmlParser());
Elements spans= xmlDoc.select("span");
for (Element link : spans) {
String html = textPlus(link);
System.out.println(html);
}
public static String textPlus(Element elem) {
List<TextNode> textNodes = elem.textNodes();
if (textNodes.isEmpty()) {
return "";
}
StringBuilder result = new StringBuilder();
// start at the first text node
Node currentNode = textNodes.get(0);
while (currentNode != null) {
// append deep text of all subsequent nodes
if (currentNode instanceof TextNode) {
TextNode currentText = (TextNode) currentNode;
result.append(currentText.text());
} else if (currentNode instanceof Element) {
Element currentElement = (Element) currentNode;
result.append(currentElement.text());
}
currentNode = currentNode.nextSibling();
}
return result.toString();
}
守則作爲this問題的答案提供了依據。
感謝您的回答!一個小問題,我不知道我應該搜索哪些元素。我試圖獲得所有'span'元素,但它沒有返回任何東西。看看我的帖子,我用我想解析的HTML的一部分編輯它。 – Jef