2012-12-21 61 views
2

我有一個String其中包含一些電子郵件的內容,我想從此String刪除所有HTML編碼。使用Jsoup刪除所有HTML但保留行

這是我的時刻代碼:

public static String html2text(String html) { 

    Document document = Jsoup.parse(html); 
    document = new Cleaner(Whitelist.basic()).clean(document); 
    document.outputSettings().escapeMode(EscapeMode.xhtml); 
    document.outputSettings().charset("UTF-8"); 
    html = document.body().html(); 

    html = html.replaceAll("<br />", ""); 

    splittedStr = html.split("Geachte heer/mevrouw,"); 

    html = splittedStr[1]; 

    html = "Geachte heer/mevrouw,"+html; 

    return html; 
} 

此方法刪除所有的HTML,不斷線且大部分佈局。但它也會返回一些&amp;nbsp;標籤,這些標籤並未完全刪除。請參閱下面的輸出,因爲您可以看到在String中仍有一些標籤甚至是其中的一部分。我如何擺脫這些?

 Loonheffingen       &amp;n= bsp; Naam 
nr         in administratie         &amp;nbs= p;           meldingen 
 nummer 

1          &amp;n= bsp;            = ;     0            &amp;= nbsp;           &amp;nbs= p;           1 
      123456789L01 

編輯:

<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">De afgekeurde meldingen zijn opgenomen in de bijlage: Afgekeurde meldingen.</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> 

<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">Wilt u zo spoedig mogelijk zorgdragen dat deze</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> 
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">meldingen gecorrigeerd worden aangeleverd?</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> 
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">mer</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> 
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">Volg &nbsp; &nbsp; Aantal verwerkt &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Aantal afgekeurde</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> 
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">&nbsp;Loonheffingen &nbsp; &nbsp; &nbsp; &nbsp; Naam</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> 
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">nr &nbsp; &nbsp; &nbsp; &nbsp; in administratie &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; meldingen</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> 
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">&nbsp;nummer</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> 
<br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"><span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> 

這是HTML我試圖解析的一部分。我想刪除所有的HTML,但保留原始電子郵件的佈局。

任何幫助表示讚賞,

謝謝!

解決

 Document xmlDoc = Jsoup.parse(file, "", Parser.xmlParser()); 
     Elements spans= xmlDoc.select("span"); 

     for (Element link : spans) { 
      String html = textPlus(link); 
      System.out.println(html); 
     } 


public static String textPlus(Element elem) { 
    List<TextNode> textNodes = elem.textNodes(); 
    if (textNodes.isEmpty()) { 
     return ""; 
    } 

    StringBuilder result = new StringBuilder(); 
    // start at the first text node 
    Node currentNode = textNodes.get(0); 
    while (currentNode != null) { 
     // append deep text of all subsequent nodes 
     if (currentNode instanceof TextNode) { 
      TextNode currentText = (TextNode) currentNode; 
      result.append(currentText.text()); 
     } else if (currentNode instanceof Element) { 
      Element currentElement = (Element) currentNode; 
      result.append(currentElement.text()); 
     } 
     currentNode = currentNode.nextSibling(); 
    } 
    return result.toString(); 
} 

守則作爲this問題的答案提供了依據。

回答

1

而不是這樣做,您需要遍歷JSoup返回的HTML結構並整理文本節點。這樣,你讓JSoup確定什麼是真正的文本,並且將爲你處理實體編碼(例如&amp; - >&等)。

有關更多信息,請參見this SO question

+0

感謝您的回答!一個小問題,我不知道我應該搜索哪些元素。我試圖獲得所有'span'元素,但它沒有返回任何東西。看看我的帖子,我用我想解析的HTML的一部分編輯它。 – Jef