2012-10-02 37 views
5

我正處在一個將基於Word的文檔轉換爲XML的非常痛苦的過程中。我遇到了以下問題:混合內容和字符串操作清理

<?xml version="1.0" encoding="UTF-8"?> 
<root> 
    <p> 
     <element>This one is taken care of.</element> Some more text. „<hi rend="italics">Is this a 
      quote</hi>?」 (Source). </p> 

    <p> 
     <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is a 
      quote</hi>」 (Source). </p> 

    <p> 
     <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is 
      definitely a quote</hi>!」 (Source). </p> 

    <p> 
     <element>This one is taken care of.</element> Some more text.„<hi rend="italics">This is a 
      first quote</hi>」 (Source). „<hi rend="italics">Sometimes there is a second quote as 
      well</hi>!?」 (Source). </p> 

</root> 

<p>節點有混合內容。 <element>我已在之前的迭代中處理過。但現在問題是引號和來源部分出現在<hi rend= "italics"/>和部分作爲文本節點。

如何使用XSLT 2.0:

  1. 匹配立即被它的最後一個字符是「「「文本節點之前的所有節點<hi rend="italics">
  2. 輸出<hi rend="italics">的內容爲<quote>...</quote>,除掉引號(「」「和」「」),但在<quote/>之內包含任何問題和感嘆號,緊接在<hi rend="italics">的兄弟之後出現?
  3. <hi rend="italics">節點之後的「(」和「)」之間的文本節點轉換爲<source>...</source>而不包含括號。
  4. 包括最終的全站。

換句話說,我的輸出應該是這樣的:

<root> 
<p> 
<element>This one is taken care of.</element> Some more text. <quote>Is this a quote?</quote> <source>Source</source>. 
</p> 

<p> 
<element>This one is taken care of.</element> Some more text. <quote>This is a quote</hi> <source>Source</source>. 
</p> 

<p> 
<element>This one is taken care of.</element> Some more text. <quote>This is definitely a quote!</hi> <source>Source</source>. 
</p> 

<p> 
<element>This one is taken care of.</element> Some more text. <quote>This is a first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as well!?</quote> <source>Source</source>. 
</p> 

</root> 

我從來沒有處理混合內容和字符串操作這樣整個事情真的扔我。我將非常感謝您的提示。

+0

輸入文檔中的問號和感嘆號在'hi'元素之外,但是在期望的輸出中,它們在'quote'元素中。這看起來很奇怪。是對的?請確認。 –

+0

這是意圖,是的。 – Tench

回答

1

這是一個替代解決方案。它允許使用更具敘事性的風格輸入文檔(引號內引號,一個文本節點內的多個(源)片段,'''作爲未跟隨hi元素時的數據。

<xsl:stylesheet version="2.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    xmlns:so="http://stackoverflow.com/questions/12690177" 
    xmlns:xs="http://www.w3.org/2001/XMLSchema" 
    exclude-result-prefixes="xsl xs so"> 
<xsl:output omit-xml-declaration="yes" indent="yes" /> 
<xsl:strip-space elements="*" /> 

<xsl:template match="@*|comment()|processing-instruction()"> 
    <xsl:copy /> 
</xsl:template> 

<xsl:template match="*"> 
    <xsl:copy> 
    <xsl:apply-templates select="@*|node()" /> 
    </xsl:copy> 
</xsl:template> 

<xsl:function name="so:clip-start" as="xs:string"> 
    <xsl:param name="in-text" as="xs:string" /> 
    <xsl:value-of select="substring($in-text,1,string-length($in-text)-1)" /> 
</xsl:function> 

<xsl:function name="so:clip-end" as="xs:string"> 
    <xsl:param name="in-text" as="xs:string" /> 
    <xsl:value-of select="substring-after($in-text,'」')" /> 
</xsl:function> 

<xsl:function name="so:matches-start" as="xs:boolean"> 
    <xsl:param name="text-node" as="text()" /> 
    <xsl:value-of select="$text-node/following-sibling::node()/self::hi[@rend='italics'] and 
         ends-with($text-node, '„')" /> 
</xsl:function> 

<xsl:template match="text()[so:matches-start(.)]" priority="2"> 
    <xsl:call-template name="parse-text"> 
    <xsl:with-param name="text" select="so:clip-start(.)" /> 
    </xsl:call-template> 
</xsl:template> 

<xsl:function name="so:matches-end" as="xs:boolean"> 
    <xsl:param name="text-node" as="text()" /> 
    <xsl:value-of select="$text-node/preceding-sibling::node()/self::hi[@rend='italics'] and 
         matches($text-node,'^[!?]*」')" /> 
</xsl:function> 

<xsl:template match="text()[so:matches-end(.)]" priority="2"> 
    <xsl:call-template name="parse-text"> 
    <xsl:with-param name="text" select="so:clip-end(.)" /> 
    </xsl:call-template> 
</xsl:template> 

<xsl:template match="text()[so:matches-start(.)][so:matches-end(.)]" priority="3"> 
    <xsl:call-template name="parse-text"> 
    <xsl:with-param name="text" select="so:clip-end(so:clip-start(.))" /> 
    </xsl:call-template> 
</xsl:template> 

<xsl:template match="text()" name="parse-text" priority="1"> 
    <xsl:param name="text" select="." /> 
    <xsl:analyze-string select="$text" regex="\(([^)]*)\)"> 
    <xsl:matching-substring> 
     <source> 
     <xsl:value-of select="regex-group(1)" /> 
     </source> 
    </xsl:matching-substring> 
    <xsl:non-matching-substring> 
     <xsl:value-of select="." /> 
    </xsl:non-matching-substring> 
    </xsl:analyze-string> 
</xsl:template> 

<xsl:template match="hi[@rend='italics']"> 
    <quote> 
    <xsl:apply-templates select="(@* except @rend) | node()" /> 
    <xsl:for-each select="following-sibling::node()[1]/self::text()[matches(.,'^[!?]')]"> 
     <xsl:value-of select="replace(., '^([!?]+).*$', '$1')" /> 
    </xsl:for-each> 
    </quote> 
</xsl:template> 

</xsl:stylesheet> 
2

這種轉變

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 
    <xsl:output omit-xml-declaration="yes"/> 

<xsl:template match="node()|@*"> 
    <xsl:copy> 
     <xsl:apply-templates select="node()|@*"/> 
    </xsl:copy> 
</xsl:template> 

<xsl:template match= 
    "hi[@rend='italics' 
    and 
     preceding-sibling::node()[1][self::text()[ends-with(., '„')]] 
     ]"> 

    <quote> 
    <xsl:value-of select= 
    "concat(., 
      if(matches(following-sibling::text()[1], '^[?!]+')) 
       then replace(following-sibling::text()[1], '^([?!]+).*$', '$1') 
       else() 
      ) 
     "/> 
    </quote> 
</xsl:template> 

<xsl:template match="text()[true()]"> 
    <xsl:variable name="vThis" select="."/> 
    <xsl:variable name="vThis2" select="translate($vThis, '„」?!', '')"/> 

    <xsl:value-of select="substring-before(concat($vThis2, '('), '(')"/> 
    <xsl:if test="contains($vThis2, '(')"> 
    <source> 
    <xsl:value-of select= 
     "substring-before(substring-after($vThis2, '('), ')')"/> 
    </source> 
    <xsl:value-of select="substring-after($vThis2, ')')"/> 
    </xsl:if> 
</xsl:template> 
</xsl:stylesheet> 

時所提供的XML文檔應用:

<root> 
     <p> 
      <element>This one is taken care of.</element> Some more text. „<hi rend="italics">Is this a 
       quote</hi>?」 (Source). </p> 

     <p> 
      <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is a 
       quote</hi>」 (Source). </p> 

     <p> 
      <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is 
       definitely a quote</hi>!」 (Source). </p> 

     <p> 
      <element>This one is taken care of.</element> Some more text.„<hi rend="italics">This is a 
       first quote</hi>」 (Source). „<hi rend="italics">Sometimes there is a second quote as 
       well</hi>!?」 (Source). </p> 

</root> 

產生想要的,正確的結果

<root> 
     <p> 
      <element>This one is taken care of.</element> Some more text. <quote>Is this a 
       quote?</quote> <source>Source</source>. </p> 

     <p> 
      <element>This one is taken care of.</element> Some more text. <quote>This is a 
       quote</quote> <source>Source</source>. </p> 

     <p> 
      <element>This one is taken care of.</element> Some more text. <quote>This is 
       definitely a quote!</quote> <source>Source</source>. </p> 

     <p> 
      <element>This one is taken care of.</element> Some more text.<quote>This is a 
       first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as 
       well!?</quote> <source>Source</source>. </p> 

</root> 
+0

+1。我認爲,在文本節點中,我傾向於使用'analyze-string'來替換'(Source)'文本中的' Source'文本替代XSLT 2.0。我想知道文本節點中的所有引號字符和標點符號是否可以像您一樣簡單地刪除,或者只需要在這些「hi」元素之前或之後出現時刪除它們。 –

+0

在Saxon上,這會引發很多可恢復的錯誤:文本()的模糊規則匹配。 –

+0

@MartinHonnen,是的,'xsl:analyze-string'很好,如果問題更復雜,我會使用它。至於要刪除角色的位置,目前這個問題還不清楚 - 無論如何都可以輕鬆完成。我的目的是想出一個簡短的解決方案 - 我認爲,我做到了。 –