2017-06-15 23 views
0

我試圖替換某些文本字符串,然後從同一文本字符串中刪除所有RTF標籤。刪除特殊字符,並使用xslt進行搜索和替換

所以的初始值是:

<test> 
<data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2{\fonttbl{\f0\fcharset0  Times New Roman;}{\f2\fcharset0 Segoe UI;}{\f3\fcharset0 arial;}}{\colortbl\red0\green0\blue0;\red255\green255\blue255;}\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{{\ltrch Ingredients: roast British chicken breast \'b7 chicken stock mayo and smoked \'b7 prawns with mayo on malted brown bread \'b7 smoked British ham with mustard mayo on oatmeal bread \'b7 .}\li0\ri0\sa0\sb0\fi0\ql\par} 
{{\ltrch }{\ltrch }{\ltrch }\li0\ri0\sa0\sb0\fi0\ql\par} 
{{\ltrch roast British chicken breast \'b7 chicken stock mayo and smoked : Chicken Breast (25.89%) \'b7 }{\ltrch {\b Wheatflour}}{\ltrch contains }{\ltrch {\b Gluten}}{\ltrch (with Wheatflour \'b7 Calcium Carbonate \'b7 Iron \'b7 Niacin \'b7 Thiamin) \'b7 Water \'b7 Pork (10.32%) \'b7 Malted }{\ltrch {\b Wheatflakes}}{\ltrch (contain }{\ltrch {\b Gluten}}{\ltrch) \'b7 Rapeseed Oil \'b7 }{\ltrch {\b Wheat}}\li0\ri0\sa0\sb0\fi0\ql\par} 
{{\ltArch }{\ltrch }{\ltrch }\li0\ri0\sa0\sb0\fi0\ql\par} 

} 
} 
</test> 

那麼什麼需要做:

  1. 價值觀像{\b Wheat}應該成爲<bold>Wheat</bold> - 其中小麥可以是任何東西或不同。
  2. \ 'B7應該成爲一個逗號(',')

其結果將是:

<test> 
<data>Ingredients: roast British chicken breast , chicken stock mayo and smoked , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , . 
roast British chicken breast , chicken stock mayo and smoked : Chicken Breast (25.89%) , <bold> Wheatflour</bold> contains <bold>Gluten</bold>(with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold> Wheatflakes</bold>contain <bold> Gluten</bold>, Rapeseed Oil , <bold> Wheat</bold> 
</data> 
</test> 

可以這樣做?如果是這樣,怎麼樣?

回答

0

如果您可以使用包含正則表達式功能的XSLT 2.0或更新版本,這並不是非常困難。你的關鍵是replace()功能。

這裏的代碼片段開始清理您的RTF爛攤子:

<?xml version="1.0" encoding="UTF-8"?><test> 
    <data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{Ingredients: roast British chicken breast , chicken stock mayo and smoked , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .\li0\ri0\sa0\sb0\fi0\ql\par} 
     { \li0\ri0\sa0\sb0\fi0\ql\par} 
     {roast British chicken breast , chicken stock mayo and smoked : Chicken Breast (25.89%) , <bold>Wheatflour</bold> contains <bold>Gluten</bold> (with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold>Wheatflakes</bold> (contain <bold>Gluten</bold>) , Rapeseed Oil , <bold>Wheat</bold>\li0\ri0\sa0\sb0\fi0\ql\par} 
     {{\ltArch } \li0\ri0\sa0\sb0\fi0\ql\par} 

     } 
     }</data> 
</test> 

<xsl:template match="data"> 
    <xsl:copy> 
     <!-- Note: XSL variables are _immutable_: once created, their values 
      cannot be changed. I use a chain of variables here simply for 
      purposes of illustration, as a means of showing each regex 
      replacement operation on its own. These could all be stacked 
      into a single statement, but that is somewhat harder for 
      humans to read. :) --> 
     <xsl:variable name="bolded" select="replace(., '\{\\b (.*?)\}', '&lt;bold&gt;$1&lt;/bold&gt;')"/> 
     <xsl:variable name="commas" select="replace($bolded, '\\''b7', ',')"/> 
     <xsl:variable name="unfonted" select="replace($commas, '\{\\fonttbl\{.*?\}\}', '')"/> 
     <xsl:variable name="uncolored" select="replace($unfonted, '\{\\colortbl\\.*?\}', '')"/> 
     <xsl:variable name="no-ltrch" select="replace($uncolored, '\{\\ltrch (.*?)\}', '$1')"/> 
     <xsl:value-of select="$no-ltrch" disable-output-escaping="yes"/> 
    </xsl:copy> 
</xsl:template> 

這目前輸出(將平倉</data>標記,缺少在您的樣品輸入XML後)

在這一點上,你只需要找出去除剩餘的RTF代碼所需的正則表達式的其餘部分。