2017-02-18 220 views
-1

我想用這個xml來準備一個xsd並進一步處理這些行以將數據插入到數據庫中。爲了準備xsd,使用xslt將結構轉換爲所需的格式。如果節點值包含url,如何刪除xml節點?

<linked-hash-map> 
    <entry> 
    <string>_type</string> 
    <string>News</string> 
    </entry> 
    <entry> 
    <string>value</string> 
    <list> 
     <linked-hash-map> 
     <entry> 
      <string>name</string> 
      <string> 
      Virat Kohli 
      </string> 
     </entry> 
     <entry> 
      <string>url</string> 
      <string> 
      http://www.bing.com/cr?IG=3DA864FA197A4D5DAD062780C15E3A16&CID=09E4F1057ADB64720330FB2E7BC96547&rd=1&h=nw8K4uNRgs-nvsuz2GyXpqMxdRmzWK8Xbm3W_1IlO24&v=1&r=http%3a%2f%2fmovies.ndtv.com%2fbollywood%2fvirat-kohli-hearts-anushka-sharma-a-timeline-of-their-romance-1659877&p=DevEx,5026.1 
      </string> 
     </entry> 
     <entry> 
      <string>image</string> 
      <linked-hash-map> 
      <entry> 
       <string>thumbnail</string> 
       <linked-hash-map> 
       <entry> 
        <string>contentUrl</string> 
        <string> 
        https://www.bing.com/th?id=ON.EE674002EC235BD5795D34695EABF504&pid=News 
        </string> 
       </entry> 
       <entry> 
        <string>width</string> 
        <int>640</int> 
       </entry> 
       </linked-hash-map> 
      </entry> 
      </linked-hash-map> 
     </entry> 
     <entry> 
      <string>description</string> 
      <string> 
      On Wednesday, cricketer Virat Kohli 
      </string> 
     </entry> 
     <entry> 
      <string>datePublished</string> 
      <string>2017-02-16T05:39:00</string> 
     </entry> 
     <entry> 
      <string>category</string> 
      <string>Entertainment</string> 
     </entry> 
     </linked-hash-map> 
     <linked-hash-map> 
     <entry> 
      <string>name</string> 
      <string> 
      Shah Rukh Khan’s TV show 
      </string> 
     </entry> 
     <entry> 
      <string>url</string> 
      <string> 
      http://www.bing.com/cr?IG=3DA864FA197A4D5DAD062780C15E3A16&CID=09E4F1057ADB64720330FB2E7BC96547&rd=1&h=4CnQhOg9Nm7pmIu9OvDl6x9WtYtSuXblCSR_WQz1VoA&v=1&r=http%3a%2f%2fwww.hindustantimes.com%2ftv%2fshah-rukh-khan-s-tv-show-circus-is-back-on-small-screen%2fstory-OjQUQIWi6ogxj5eF1hivTI.html&p=DevEx,5040.1 
      </string> 
     </entry> 
     <entry> 
      <string>image</string> 
      <linked-hash-map> 
      <entry> 
       <string>thumbnail</string> 
       <linked-hash-map> 
       <entry> 
        <string>contentUrl</string> 
        <string> 
        https://www.bing.com/th?id=ON.2974262BB8317FA4D4BCE4A61CA9488E&pid=News 
        </string> 
       </entry> 
       <entry> 
        <string>width</string> 
        <int>700</int> 
       </entry> 
       </linked-hash-map> 
      </entry> 
      </linked-hash-map> 
     </entry> 
     <entry> 
      <string>description</string> 
      <string> 
      Here’s some wonderful news 
      </string> 
     </entry> 
     <entry> 
      <string>datePublished</string> 
      <string>2017-02-16T05:36:00</string> 
     </entry> 
     <entry> 
      <string>category</string> 
      <string>Entertainment</string> 
     </entry> 
     </linked-hash-map> 
    </list> 
    </entry> 
</linked-hash-map> 

這裏Url有querystrings。如何刪除網址或如何用查詢字符串編碼網址?

希望的輸出:

<?xml version="1.0" encoding="utf-8"?> 
<linked-hash-map> 
    <entry> 
    <linked-hash-map> 
     <_type>News</_type> 
     <datarow> 
     <name> Virat Kohli</name> 
     <url>http://www.bing.com/cr?IG=3DA864FA197A4D5DAD062780C15E3A16&CID=09E4F1057ADB64720330FB2E7BC96547&rd=1&h=nw8K4uNRgs-nvsuz2GyXpqMxdRmzWK8Xbm3W_1IlO24&v=1&r=http%3a%2f%2fmovies.ndtv.com%2fbollywood%2fvirat-kohli-hearts-anushka-sharma-a-timeline-of-their-romance-1659877&p=DevEx,5026.1</url> 
     <contentUrl> https://www.bing.com/th?id=ON.EE674002EC235BD5795D34695EABF504&pid=News </contentUrl> 
     <width>640</width> 
     <description> On Wednesday, cricketer Virat Kohli</description> 
     <readLink> https://api.cognitive.microsoft.com/api/v5/entities/b8ef6b82-02be-1e24-584c-f8283b7bdaeb </readLink> 
     <datePublished>2017-02-16T05:39:00</datePublished> 
     <category>Entertainment</category>  
     </datarow> 
     <datarow> 
     <name> Shah Rukh Khan’s TV show</name> 
     <url> http://www.bing.com/cr?IG=3DA864FA197A4D5DAD062780C15E3A16&CID=09E4F1057ADB64720330FB2E7BC96547&rd=1&h=4CnQhOg9Nm7pmIu9OvDl6x9WtYtSuXblCSR_WQz1VoA&v=1&r=http%3a%2f%2fwww.hindustantimes.com%2ftv%2fshah-rukh-khan-s-tv-show-circus-is-back-on-small-screen%2fstory-OjQUQIWi6ogxj5eF1hivTI.html&p=DevEx,5040.1 </url> 
     <contentUrl> https://www.bing.com/th?id=ON.EE674002EC235BD5795D34695EABF504&pid=News </contentUrl> 
     <width>640</width> 
     <description> Here’s some wonderful news </description> 
     <readLink> https://api.cognitive.microsoft.com/api/v5/entities/b8ef6b82-02be-1e24-584c-f8283b7bdaeb </readLink> 
     <datePublished>2017-02-16T05:39:00</datePublished> 
     <category>Entertainment</category> 
     </datarow> 
    </linked-hash-map> 
    </entry> 
</linked-hash-map> 
下面

是,我使用的這種結構轉換腳本。

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> 
    <xsl:strip-space elements="*"/> 

    <xsl:template match="node()|@*"> 
    <xsl:copy> 
     <xsl:apply-templates select="node()|@*"/> 
    </xsl:copy> 
    </xsl:template> 

    <xsl:template match="/linked-hash-map"> 
    <xsl:element name="{local-name()}"> 
     <xsl:for-each select="entry"> 
     <xsl:choose> 
      <xsl:when test="list/linked-hash-map"> 
      <xsl:for-each select="list/linked-hash-map"> 
       <datarow> 
       <xsl:for-each select="entry"> 
        <xsl:if test="not(node()[1]='image' or node()[1]='about' or node()[1]='clusteredArticles' or node()[1]='mentions' or node()[1]='provider' or node()[1]='url' or node()[1]='description' or node()[1]='name')"> 
        <xsl:text disable-output-escaping="yes">&lt;</xsl:text> 
        <xsl:value-of select="*[1]"/> 
        <xsl:text disable-output-escaping="yes">&gt;</xsl:text> 
        <xsl:value-of select="*[2]"/> 
        <xsl:text disable-output-escaping="yes">&lt;/</xsl:text> 
        <xsl:value-of select="*[1]"/> 
        <xsl:text disable-output-escaping="yes">&gt;</xsl:text> 
        </xsl:if> 
       </xsl:for-each> 
       </datarow> 
      </xsl:for-each> 
      </xsl:when> 
      <xsl:otherwise> 
      <xsl:text disable-output-escaping="yes">&lt;</xsl:text> 
      <xsl:value-of select="*[1]"/> 
      <xsl:text disable-output-escaping="yes">&gt;</xsl:text> 
      <xsl:value-of select="*[2]"/> 
      <xsl:text disable-output-escaping="yes">&lt;/</xsl:text> 
      <xsl:value-of select="*[1]"/> 
      <xsl:text disable-output-escaping="yes">&gt;</xsl:text> 
      </xsl:otherwise> 
     </xsl:choose> 
     </xsl:for-each> 
    </xsl:element> 
    </xsl:template> 
    <xsl:template match="/"> 
    <xsl:copy> 
     <linked-hash-map> 
     <entry> 
      <xsl:apply-templates/> 
     </entry> 
     </linked-hash-map> 
    </xsl:copy> 
    </xsl:template> 

</xsl:stylesheet> 
+0

你的嘗試腳本在哪裏?你會得到什麼錯誤或不良結果? – Parfait

+0

當我用任何腳本運行時,它首先失敗。爲了繼續前進,現在我已經通過java代碼操作了&符號並用空白取代。我已更新該帖子。請參閱上文。 – user3187932

回答

0

目前沒有很好地形成原始的XML作爲在URL中使用的符號必須使用相應的XML entity references,即&amp;更換。

請仔細檢查原始XML是如何呈現的,因爲它不應該作爲連接字符串的文本文件進行開發(這種標記的一種構建方式)。不幸的是,這在通用編程中是很常見的做法。應使用符合W3C的DOM庫(即,Java的javax.xml,Python的xml.etree,PHP的DOMDocument,.NET的XmlDocument)和它們的createElement,appendChildsetAttribute或相應的方法來構建XML文檔。

一旦呈現了有效的XML,請考慮下面更一般化的XSLT。

輸入(調整字符實體)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
<linked-hash-map> 
    <entry> 
    <string>_type</string> 
    <string>News</string> 
    </entry> 
    <entry> 
    <string>value</string> 
    <list> 
     <linked-hash-map> 
     <entry> 
      <string>name</string> 
      <string> 
      Virat Kohli 
      </string> 
     </entry> 
     <entry> 
      <string>url</string> 
      <string> 
      http://www.bing.com/cr?IG=3DA864FA197A4D5DAD062780C15E3A16&amp;CID=09E4F1057ADB64720330FB2E7BC96547&amp;rd=1&amp;h=nw8K4uNRgs-nvsuz2GyXpqMxdRmzWK8Xbm3W_1IlO24&amp;v=1&amp;r=http%3a%2f%2fmovies.ndtv.com%2fbollywood%2fvirat-kohli-hearts-anushka-sharma-a-timeline-of-their-romance-1659877&amp;p=DevEx,5026.1 
      </string> 
     </entry> 
     <entry> 
      <string>image</string> 
      <linked-hash-map> 
      <entry> 
       <string>thumbnail</string> 
       <linked-hash-map> 
       <entry> 
        <string>contentUrl</string> 
        <string> 
        https://www.bing.com/th?id=ON.EE674002EC235BD5795D34695EABF504&amp;pid=News 
        </string> 
       </entry> 
       <entry> 
        <string>width</string> 
        <int>640</int> 
       </entry> 
       </linked-hash-map> 
      </entry> 
      </linked-hash-map> 
     </entry> 
     <entry> 
      <string>description</string> 
      <string> 
      On Wednesday, cricketer Virat Kohli 
      </string> 
     </entry> 
     <entry> 
      <string>datePublished</string> 
      <string>2017-02-16T05:39:00</string> 
     </entry> 
     <entry> 
      <string>category</string> 
      <string>Entertainment</string> 
     </entry> 
     </linked-hash-map> 
     <linked-hash-map> 
     <entry> 
      <string>name</string> 
      <string> 
      Shah Rukh Khan's TV show 
      </string> 
     </entry> 
     <entry> 
      <string>url</string> 
      <string> 
      http://www.bing.com/cr?IG=3DA864FA197A4D5DAD062780C15E3A16&amp;CID=09E4F1057ADB64720330FB2E7BC96547&amp;rd=1&amp;h=4CnQhOg9Nm7pmIu9OvDl6x9WtYtSuXblCSR_WQz1VoA&amp;v=1&amp;r=http%3a%2f%2fwww.hindustantimes.com%2ftv%2fshah-rukh-khan-s-tv-show-circus-is-back-on-small-screen%2fstory-OjQUQIWi6ogxj5eF1hivTI.html&amp;p=DevEx,5040.1 
      </string> 
     </entry> 
     <entry> 
      <string>image</string> 
      <linked-hash-map> 
      <entry> 
       <string>thumbnail</string> 
       <linked-hash-map> 
       <entry> 
        <string>contentUrl</string> 
        <string> 
        https://www.bing.com/th?id=ON.2974262BB8317FA4D4BCE4A61CA9488E&amp;pid=News 
        </string> 
       </entry> 
       <entry> 
        <string>width</string> 
        <int>700</int> 
       </entry> 
       </linked-hash-map> 
      </entry> 
      </linked-hash-map> 
     </entry> 
     <entry> 
      <string>description</string> 
      <string> 
      Here's some wonderful news 
      </string> 
     </entry> 
     <entry> 
      <string>datePublished</string> 
      <string>2017-02-16T05:36:00</string> 
     </entry> 
     <entry> 
      <string>category</string> 
      <string>Entertainment</string> 
     </entry> 
     </linked-hash-map> 
    </list> 
    </entry> 
</linked-hash-map> 

XSLT(見在線評論)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> 
    <xsl:strip-space elements="*"/> 

    <!-- APPLY ONLY SECOND ENTRY OFF ROOT --> 
    <xsl:template match="/linked-hash-map"> 
    <xsl:copy>  
     <xsl:apply-templates select="entry[2]"/>  
    </xsl:copy> 
    </xsl:template> 

    <xsl:template match="entry[2]"> 
    <xsl:copy> 
     <!-- RETRIEVE FIRST ENTRY CONTENT --> 
     <xsl:element name="{preceding-sibling::entry/string[1]}"> 
     <xsl:value-of select="preceding-sibling::entry/string[2]"/> 
     </xsl:element> 
     <!-- APPLY GRANDCHILD LINKED HASH MAP --> 
     <linked-hash-map><xsl:apply-templates select="list/linked-hash-map"/></linked-hash-map> 
    </xsl:copy> 
    </xsl:template> 

    <!-- GENERALIZE FOR ALL DESCENDANT ENTRY NODES (W/O LINKED HASH MAP CHILD) --> 
    <xsl:template match="linked-hash-map">  
    <datarow> 
     <xsl:for-each select="descendant::entry[local-name(*[2])!='linked-hash-map']">   
      <xsl:element name="{string[1]}"> 
      <xsl:value-of select="normalize-space(string[2]|int)"/> 
      </xsl:element> 
     </xsl:for-each> 
     <!-- ADDED NODE (NOT PART OF ORIGINAL) --> 
     <readLink>https://api.cognitive.microsoft.com/api/v5/entities/b8ef6b82-02be-1e24-584c-f8283b7bdaeb</readLink> 
    </datarow>  
    </xsl:template> 

</xsl:stylesheet> 

輸出

<?xml version="1.0" encoding="UTF-8"?> 
<linked-hash-map> 
    <entry> 
     <_type>News</_type> 
     <linked-hash-map> 
     <datarow> 
      <name>Virat Kohli</name> 
      <url>http://www.bing.com/cr?IG=3DA864FA197A4D5DAD062780C15E3A16&amp;CID=09E4F1057ADB64720330FB2E7BC96547&amp;rd=1&amp;h=nw8K4uNRgs-nvsuz2GyXpqMxdRmzWK8Xbm3W_1IlO24&amp;v=1&amp;r=http%3a%2f%2fmovies.ndtv.com%2fbollywood%2fvirat-kohli-hearts-anushka-sharma-a-timeline-of-their-romance-1659877&amp;p=DevEx,5026.1</url> 
      <contentUrl>https://www.bing.com/th?id=ON.EE674002EC235BD5795D34695EABF504&amp;pid=News</contentUrl> 
      <width>640</width> 
      <description>On Wednesday, cricketer Virat Kohli</description> 
      <datePublished>2017-02-16T05:39:00</datePublished> 
      <category>Entertainment</category> 
      <readLink>https://api.cognitive.microsoft.com/api/v5/entities/b8ef6b82-02be-1e24-584c-f8283b7bdaeb</readLink> 
     </datarow> 
     <datarow> 
      <name>Shah Rukh Khan's TV show</name> 
      <url>http://www.bing.com/cr?IG=3DA864FA197A4D5DAD062780C15E3A16&amp;CID=09E4F1057ADB64720330FB2E7BC96547&amp;rd=1&amp;h=4CnQhOg9Nm7pmIu9OvDl6x9WtYtSuXblCSR_WQz1VoA&amp;v=1&amp;r=http%3a%2f%2fwww.hindustantimes.com%2ftv%2fshah-rukh-khan-s-tv-show-circus-is-back-on-small-screen%2fstory-OjQUQIWi6ogxj5eF1hivTI.html&amp;p=DevEx,5040.1</url> 
      <contentUrl>https://www.bing.com/th?id=ON.2974262BB8317FA4D4BCE4A61CA9488E&amp;pid=News</contentUrl> 
      <width>700</width> 
      <description>Here's some wonderful news</description> 
      <datePublished>2017-02-16T05:36:00</datePublished> 
      <category>Entertainment</category> 
      <readLink>https://api.cognitive.microsoft.com/api/v5/entities/b8ef6b82-02be-1e24-584c-f8283b7bdaeb</readLink> 
     </datarow> 
     </linked-hash-map> 
    </entry> 
</linked-hash-map> 
+0

HI Parfait,這工作。但是,但是當我讀取java中的xml文件時,我看到了問題,它給我提供了錯誤「1字節的UTF-8序列的無效字節1」。 – user3187932

+0

哪個xml文件?原始的還是轉型的?我只用Java 1.8運行XSLT(使用其內置的Apache Xalan和外部Saxon HE),沒有任何問題。可能是你的來源不同於我的輸入。正如我提到的那樣,請注意實體的問題。 – Parfait

+0

嗨Parfait,我已經確保xml實體引用處理得很好。但是我的名字和描述字段中會出現像下面這樣的字符。 banayai ha「親愛的」朋友 user3187932