除了從字符串在java中

幾個特定的人刪除HTML標籤

我輸入的是純文本字符串，並要求刪除所有的HTML標籤，除了像一些特定的標籤：除了從字符串在java中

<p> 
<li> 
<u> 
<li>

如果這些特定的標記具有屬性像class或id，我想刪除這些屬性。

舉幾個例子：

<a href = "#">Link</a>   -> Link 

<p>paragraph</p>     -> <p>paragraph</p> 

<p class="class1">paragraph</p> -> <p>paragraph</p>

我曾經使用過此Remove HTML tags from a String走了，但它並不能完全回答我的問題。

是否可以由一組正則表達式的或處理我可以利用一些圖書館的？

來源

2011-08-11 RandomQuestion

如何約束是你的HTML輸入？如果是任意的（X）HTML，然後單獨的正則表達式可以[是不夠的（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags）。諸如CDATA塊，註釋和腳本元素之類的東西可能會拋出簡單的正則表達式。 –

是的，它可能包含這些CDATA塊和JavaScript。我準備好利用一些圖書館。但只是想知道，如何將一個字符串中的JavaScript代碼和純文本區分開來。 – RandomQuestion

我試圖JSoup，它似乎能夠處理所有此類案件。這裏是示例代碼。

public String clean(String unsafe){ 
     Whitelist whitelist = Whitelist.none(); 
     whitelist.addTags(new String[]{"p","br","ul"}); 

     String safe = Jsoup.clean(unsafe, whitelist); 
     return StringEscapeUtils.unescapeXml(safe); 
}

對於輸入字符串

String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";

我獲得以下，這是相當多我需要的輸出。

<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>

來源

2011-08-11 19:32:32 RandomQuestion

對於簡單的HTML，這可能就足夠了：

// remove any <script> tags 
html = html.replaceAll("(?i)<script.*?</script>", "")); 
// this removes any attributes 
html = html.replaceAll("(?i)<([a-zA-Z0-9-_]*)(\\s[^>]*)>", "<$1>")); 
// this removes any tags (not li and p) 
html = html.replaceAll("(?i)<(?!(/?(li|p)))[^>]*>", ""));

希望有所幫助。

來源

2011-08-11 10:55:19 beny23

感謝BENY ..是的，它的正常工作爲簡單的HTML，但我試過Jsoup。它似乎能夠很好地處理所有這些情況。 – RandomQuestion

會後的時間限制添加我的答案來回答自己的問題就結束了。 – RandomQuestion

除了從字符串在java中

回答

相關問題