2014-02-07 66 views
1

我有具有以下結構的HTML頁面副標題解析HTML頁面:與使用XQuery

<div id="content"> 
    <h2><span class="heading">Section A</span></h2> 
    <p>Content of the section</p> 
    <p>More content in the same section</p> 
    <div>We can also have divs</div> 
    <ul><li>And</li><li>Lists</li><li>Too</li></ul> 
    <h3><span class="heading">Sub-section heading</span></h3> 
    <p>The content here can be a mixture of divs, ps, lists, etc too</p> 
    <h2><span class="heading">Section B</span></h2> 
    <p>This is section B's content</p> 
    and so on 
</div> 

我想創建以下XML結構:

<sections> 
    <section> 
     <heading>Section A</heading> 
     <content> 
      <p>Content of the section</p> 
      <p>More content in the same section</p> 
      <div>We can also have divs</div> 
      <ul><li>And</li><li>Lists</li><li>Too</li></ul> 
     </content> 
     <sub-sections> 
      <section> 
       <heading>Section B</heading> 
       <content> 
        <p>This is section B's content</p> 
       </content> 
      </section> 
     </sub-sections> 
    </section> 
</sections> 

困難我正在創建<sub-section>標籤。這是我迄今爲止的,但B節出現在A節的<content>節點內。我還爲B節獲得了<section>節點,但它沒有內容。

let $content := //div[@id="content"] 
let $headings := $content/(h2|h3|h4|h5|h6)[span[@class="heading"]] 
return 
    <sections> 
    { 
    for $heading in $headings 
    return 
     <section> 
     <heading>{$heading/span/text()}</heading> 
     <content> 
     { 
      for $paragraph in $heading/following-sibling::*[preceding-sibling::h2[1] = $heading] 
      return 
      $paragraph 
     } 
     </content> 
     </section> 
    } 
    </sections> 

在此先感謝您的任何幫助或指針。

回答

2

我首先從部分數據隔離到一個變量,然後繼續處理是:

let $content := //div[@id="content"] 
return 
    <sections> 
    { 
    for $heading in $content//h2[span[@class='heading'] ] 
    let $nextHeading := $heading/following-sibling::h2 
    let $sectionCntent := $heading/following-sibling::* except ($nextHeading,  $nextHeading/following-sibling::*) 
    return 
     <section> 
     {$sectionContent} 
     </section> 
    } 
    </sections> 

在這裏,我只是做了它只能部分,那麼你可以通過做處理分節再次類似的事情在$ sectionContent變量,但現在你必須做一些事情有點怪異的選擇的第一位,或者您部分(換另一位類似的東西):

$sectionContent except ($sectionContent[self::h3], $sectionContent[self::h3]/following-sibling::*) 
+0

非常感謝,這使我走上了正確的道路。 – Stu

2

的XQuery 3.0你可以使用window clauses將您的部分和子部分相當優雅:

<sections>{ 
    for tumbling window $section in //div[@id = 'content']/* 
     start $h2 when $h2 instance of element(h2) 
    return <section>{ 
    <heading>{$h2//text()}</heading>, 
    $section/self::h3[1]/preceding-sibling::*, 
    <sub-sections>{ 
     for tumbling window $sub-section in $section 
      start $h3 when $h3 instance of element(h3) 
     return <section>{ 
     <heading>{$h3//text()}</heading>, 
     tail($sub-section) 
     }</section> 
    }</sub-sections> 
    }</section> 
}</sections> 
+0

不幸的是我沒有訪問XQuery 3.0,或者至少不是全部。我正在使用MarkLogic版本7。 – Stu