2017-10-11 89 views
0

我有連續的標籤,而不是嵌套標籤的XML文件,如下所示:如何使用XQuery連續的標籤轉換成標籤嵌套或表

<title> 
    <subtitle> 
     <topic att="TopicTitle">Topic title 1</topic> 
     <content att="TopicSubtitle">topic subtitle 1</content> 
     <content att="Paragraph">paragraph text 1</content> 
     <content att="Paragraph">paragraph text 2</content> 
     <content att="TopicSubtitle">topic subtitle 2</content> 
     <content att="Paragraph">paragraph text 1</content> 
     <content att="Paragraph">paragraph text 2</content> 

     <topic att="TopicTitle">Topic title 2</topic> 
     <content att="TopicSubtitle">topic subtitle 1</content> 
     <content att="Paragraph">paragraph text 1</content> 
     <content att="Paragraph">paragraph text 2</content> 
     <content att="TopicSubtitle">topic subtitle 2</content> 
     <content att="Paragraph">paragraph text 1</content> 
     <content att="Paragraph">paragraph text 2</content> 
    </subtitle> 
</title> 

我使用XQuery在BaseX,我想將它與下面的列轉換爲表格:

Title  Subtitle  TopicTitle  TopicSubtitle  Paragraph 
Irrelevant Irrelevant Topic title 1 Topic Subtitle 1 paragraph text 1 
Irrelevant Irrelevant Topic title 1 Topic Subtitle 1 paragraph text 2 
Irrelevant Irrelevant Topic title 1 Topic Subtitle 2 paragraph text 1 
Irrelevant Irrelevant Topic title 1 Topic Subtitle 2 paragraph text 2 
Irrelevant Irrelevant Topic title 2 Topic Subtitle 1 paragraph text 1 
Irrelevant Irrelevant Topic title 2 Topic Subtitle 1 paragraph text 2 
Irrelevant Irrelevant Topic title 2 Topic Subtitle 2 paragraph text 1 
Irrelevant Irrelevant Topic title 2 Topic Subtitle 2 paragraph text 2 

我是新來的XQuery和XPath,但我已經明白如何通過節點導航的基本知識,並選擇我需要的人。我還不知道的是如何處理我想要轉換爲嵌套XML或表格(CSV?)的連續數據。誰能幫忙?

回答

5

例如,您可以使用tumbling windowhttps://www.w3.org/TR/xquery-30/#id-windows)將平面XML轉換爲嵌套XML。

for tumbling window $w in title/subtitle/* 
    start $t when $t instance of element(topic) 
return 
    <topic 
     title="{$t/@att}"> 
     { 
      for tumbling window $content in tail($w) 
       start $c when $c/@att = 'TopicSubtitle' 
      return 
       <subtopic 
        title="{$c/@att}"> 
        { 
         tail($content) ! <para>{node()}</para> 
        } 
       </subtopic> 
     } 
    </topic> 

給出了基於該

<topic title="TopicTitle"> 
    <subtopic title="TopicSubtitle"> 
     <para>paragraph text 1</para> 
     <para>paragraph text 2</para> 
    </subtopic> 
    <subtopic title="TopicSubtitle"> 
     <para>paragraph text 1</para> 
     <para>paragraph text 2</para> 
    </subtopic> 
</topic><topic title="TopicTitle"> 
    <subtopic title="TopicSubtitle"> 
     <para>paragraph text 1</para> 
     <para>paragraph text 2</para> 
    </subtopic> 
    <subtopic title="TopicSubtitle"> 
     <para>paragraph text 1</para> 
     <para>paragraph text 2</para> 
    </subtopic> 
</topic> 

我想,那麼你可以將整個與

string-join(
<title> 
    <subtitle> 
     { 
      for tumbling window $w in title/subtitle/* 
       start $t when $t instance of element(topic) 
      return 
       <topic 
        title="{$t/@att}" 
        value="{$t}"> 
        { 
         for tumbling window $content in tail($w) 
          start $c when $c/@att = 'TopicSubtitle' 
         return 
          <subtopic 
           title="{$c/@att}" 
           value="{$c}"> 
           { 
            tail($content) ! <para>{node()}</para> 
           } 
          </subtopic> 
        } 
       </topic> 
     } 
    </subtitle> 
</title>//para ! string-join(ancestor-or-self::* ! (text(), @value, 'Irrelevant')[1], ';'), '&#10;') 
+0

這是偉大的。正是我需要的。在研究了更多關於翻滾窗口之後,我懷疑自己能夠找到它。花了一點時間適應我的文件,但它現在正在與幾個嵌套滾動窗口工作。因爲它看起來有點骯髒,所以我想問,你知道有更好的方法來做到這一點嗎?我的意思是,使用Java,Python或其他語言更適合這類任務?感謝您的幫助! – ChuyTM

+0

對於那些主要在做XSLT的人(在這裏你可以使用嵌套的'xsl:for-each-group group-starting-with'),它已經使用XQuery感覺「髒」了,但我認爲這些語言是處理XML的好選擇。如果您正在尋找更好的結構來將XML與XQuery轉換爲CSV,請查看https://github.com/CliffordAnderson/XQuery4Humanists/blob/master/05-Generating-JSON-and-CSV.md。至於Python,我不太瞭解Python,即使我知道我認爲它將取決於您可以安裝哪個模塊。 –

+0

對於純Java和內置的XML類,我認爲它需要很多代碼,我不知道Java 8的流/分組足夠好以估計它需要的代碼量, –

1

以分號分隔的數據雖然位置分組就是這種最普通的方法問題(就像Martin Honnen所描述的那樣,XQuery 3.0+中的窗口翻滾,XSLT 2.0+中的for-each-group/@group-starting-with)我認爲這不是必須的,因爲你不是實際上試圖利用數據中隱含的分層結構。

具體來說,要轉換一個平面結構與層次隱到另一個平面結構與層次隱,你可以做到這一點大意如下的內容:

<table>{ 
    for $para in title/subtitle/content[@att='paragraph'] 
    return <row> 
     <cell>irrelevant</cell> 
     <cell>irrelevant</cell> 
     <cell>{$para/preceding-sibling::topic[1]/string()}</cell> 
     <cell>{$para/preceding-sibling::content[@att='TopicSubtitle'][1]/string()}</cell> 
     <cell>{$para/string()}</cell> 
    </row> 
}</table>