刪除XML中的所有HTML

我想將一些XML提供給Apache Solr，但某些XML包含一些HTML格式的文本，不會讓我發佈到我的solr服務器。顯然，如果能夠保留這些信息將會很好，因爲我的文檔可能會在發佈之前進行預格式化。但是我沒有看到或意識到是否轉義會避免solr與HTML的問題。我的問題很熱，我使用XSLT從XML中刪除HTML嗎？刪除XML中的所有HTML

例如：

What I have: 

<field name="description"><h1>This is a description of a doc!</h1><p> This doc contains some information</p></field> 

What I need: 

<field name="description">This is a description of a doc! This doc contains some information.</field>

我想有一個聰明的修復，而不是特定標籤的黑名單XSL轉換期間不擦洗。這將是低效的，因爲如果決定創建一個帶有say標籤的新文檔，黑名單將不會看到這個，除非程序員手動添加它。

我試圖轉換HTML標記HTML實體（<和&gr;爲<和>），但這個螺絲了東西后向下行，當我嘗試通過BasicNameValuePairs通過HtmlPost張貼此。我不想使用這些實體。

任何想法StackOverflow？

來源

2013-07-26 bneigher

如果您知道包含HTML的元素，則可以匹配任何元素後代並執行apply-templates。

實施例...

XML輸入

<field name="description"><h1>This is a <b>description</b> of a doc!</h1><!--Here's a comment--><p> This doc contains some information</p></field>

XSLT 1.0

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 
    <xsl:output method="xml" indent="yes"/> 

    <xsl:template match="node()|@*"> 
     <xsl:copy> 
      <xsl:apply-templates select="node()|@*"/> 
     </xsl:copy> 
    </xsl:template> 

    <xsl:template match="node()[ancestor::field and not(self::text())]"> 
     <xsl:apply-templates/> 
    </xsl:template> 

</xsl:stylesheet>

XML輸出

<field name="description">This is a description of a doc! This doc contains some information</field>

來源

2013-07-26 06:10:49

但是，這不會擺脫HTML評論，如所以它不像我想的那麼聰明。你看？ – bneigher

@BenjaminNeigher - 你可以把匹配改成node（）[ancestor :: field而不是（self :: text（））]'。 –

@BenjaminNeigher - 我更新了我的示例以顯示評論也被刪除。 –

刪除XML中的所有HTML

回答

相關問題