2013-08-07 64 views
3

我的XSLT轉換已經成功數月,直到我運行帶有Unicode字符(很可能是表情符號)的XML文件。我需要保留Unicode,但XSLT將其轉換爲HTML實體。我認爲將編碼設置爲UTF-8可以解決我的問題,但我仍然遇到問題。使用XSLT轉換XML並保留Unicode字符

任何幫助表示讚賞。代碼:

private byte[] transform(InputStream stream) throws Exception{ 
    System.setProperty("javax.xml.transform.TransformerFactory", "org.apache.xalan.processor.TransformerFactoryImpl"); 

    Transformer xmlTransformer; 

    xmlTransformer = (TransformerImpl) TransformerFactory.newInstance().newTransformer(new StreamSource(createXsltStylesheet())); 
    xmlTransformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); 

    XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(stream,"UTF-8"); 
    Source staxSource = new StAXSource(reader, true); 
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); 
    Writer writer = new OutputStreamWriter(outputStream, "UTF-8"); 
    xmlTransformer.transform(staxSource, new StreamResult(writer)); 


    return outputStream.toByteArray(); 
} 

如果我添加

xmlTransformer.setOutputProperty(OutputKeys.METHOD, "text"); 

Unicode的被保留,但XML不是。

+0

類似(但遺憾的是還沒有答案)http://stackoverflow.com/questions/15592025/transformer-setoutputpropertyoutputkeys-encoding-utf-8-is-not-working,這是尋找更好的:HTTP:// stackoverflow.com/questions/443305/producing-valid-xml-with-java-and-utf-8-encoding – Tomalak

回答

0

此行是可疑的:

stream = IOUtils.toInputStream(outputStream.toString(),"UTF-8"); 

您在使用的平臺,這可能不是UTF-8的默認編碼轉換ByteArrayOutputStream爲String。將其更改爲

stream = IOUtils.toInputStream(outputStream.toString("UTF-8"),"UTF-8"); 

,或者有更好的表現,只是包裝的字節數組中ByteArrayInputStream

return new ByteArrayInputStream(outputStream.toByteArray()); 
+0

感謝您的評論。這條線實際上是在問題之後。當我打電話給變壓器時,表情符號會改變。我已更新我的代碼以反映我的最新更改。 – l15a

0

嘗試使用Apache串行轉換爲字符串的XML。

//Serialize DOM 
OutputFormat format = new OutputFormat (doc); 
// as a String 
StringWriter stringOut = new StringWriter();  
XMLSerializer serial = new XMLSerializer (stringOut, 
                format); 
serial.serialize(doc); 
// Display the XML 
System.out.println(stringOut.toString()); 
0

過這個同樣的問題,只是跑了,之後的時間太長了研究它,這裏就是我的結論。

Java XSLT處理器將多字節UTF-8字符轉義爲HTML實體即使輸出模式爲XML ...如果多字節字符出現在未包裝在CDATA中的text()節點中。 如果字符包裝在CDATA中(用於輸出),則多字節字符將爲保留

我的問題:

我有一個看起來像這樣的XML文件,完整的表情符號。

<events> 
    <event> 
     <id>RANDOMID</id> 
     <blah> 
      <blahId>FOOONE</blahId> 
     </blah> 
     <blah> 
      <blahId>FOOTWO</blahId> 
     </blah> 
     <eventComment>Did some things. Had some Fun. </eventComment> 
    </event> 
</events> 

我開始與一個XSL樣式表是這樣的:

<xsl:stylesheet version="1.0" 
       xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
       xmlns="http://www.w3.org/TR/xhtml1/strict" 
> 
    <xsl:output method = "xml" version="1.0" encoding = "UTF-8" omit-xml-declaration="no" indent="yes" /> 

    <xsl:template match="/"> 
     <events> 
      <xsl:for-each select="/events/event"> 
       <event> 
        <xsl:copy-of select="./*[name() != 'blah'"/> 
        <xsl:for-each select="./blah"> 
         <blahId><xsl:copy-of select="./blahId/text()"/></blahId> 
        </xsl:for-each> 
       </event> 
      </xsl:for-each> 
     </events> 
    </xsl:template> 
</xsl:stylesheet> 

與Java運行變壓器這始終產生&#55357;&#56397;在我的表情符應。隨後嘗試解析生成的文檔失敗,併發出以下異常消息:

org.xml.sax.SAXParseException; lineNumber: y; columnNumber: x; Character reference "&#55357" is an invalid XML character. 

HOGWASH!

在命令行上用xsltproc進行測試毫無用處,因爲xsltproc對於多字節字符不是愚蠢的。我得到了我期望的結果。

一個解決方案

具有XSLT由xsl:output標籤cdata-section-elements屬性指定的QName將保留字節包裹在CDATA的eventComment與xsltproc的和Java變壓器工作。

這裏的神奇是來自<xsl:output>標記的輸出cdata-secion-elements屬性。來自xsltproc

<xsl:stylesheet version="1.0" 
       xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
       xmlns="http://www.w3.org/TR/xhtml1/strict" 
> 
    <xsl:output cdata-section-elements="eventComment" method="xml" version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/> 

    <xsl:template match="/"> 
     <events> 
      <xsl:for-each select="/events/event"> 
       <event> 
        <xsl:copy-of select="./*[name() != 'blah' and name() != 'eventComment']"/> 
        <!-- For the cdata-section-elements to resolve that eventComment needs to be preserved as CDATA 
         (so we don't get java doing stupid things with unicode escapment) 
         it needs to be explicitly referenced here. 
        --> 
        <eventComment><xsl:copy-of select="./eventComment/text()"/></eventComment> 
        <xsl:for-each select="./blah"> 
         <blahId><xsl:copy-of select="./blahId/text()"/></blahId> 
        </xsl:for-each> 
       </event> 
      </xsl:for-each> 
     </events> 
    </xsl:template> 
</xsl:stylesheet> 

現在我的輸出和一個Java變壓器看起來是這樣的,並用java DocumentBuilders愉快解析:https://www.w3.org/TR/xslt#output

我更新了我的XSL模板是。

<?xml version="1.0" encoding="UTF-8"?> 
<events xmlns="http://www.w3.org/TR/xhtml1/strict"> 
    <event> 
    <id xmlns="">RANDOMID</id> 
    <eventComment><![CDATA[Did some things. Had some Fun. ]]></eventComment> 
    <blahId>FOO</blahId> 
    <blahId>FOOTOO</blahId> 
    </event> 
</events>