XPath：選擇某些子節點

我正在使用XPath和Scrapy從電影網站BoxOfficeMojo.com上刪除數據。XPath：選擇某些子節點

作爲一個普遍的問題：我想知道如何選擇一個Xpath字符串中的一個父節點的某些子節點。

根據我從中抓取數據的電影網頁，有時我需要的數據位於不同的子節點，比如是否存在鏈接。我將會瀏覽大約14000部電影，所以這個過程需要自動化。以this爲例。我需要演員，導演和製片人。

這是Xpath導演：注：這％s對應於該信息被發現一個確定的索引 - 在動作傑克遜示例director在[1]和actors被發現在[2]。

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()

然而，將一個鏈接存在對導演頁面，這將是Xpath：

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()

演員都比較麻煩一些，因爲有 包括後續演員上市，其可能是/a或父/font子女的孩子，所以：

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()

獲取所有最重要的演員s（font/br除外）。

現在，我認爲這裏的主要問題是，有多個//div[@class="mp_box_content"] - 我有的所有工作除了我也最終從其他mp_box_content獲得一些數字。此外，我還添加了大量的try:,except:聲明以獲取所有內容（演員，導演，製作人員都有和沒有鏈接關聯）。例如，以下是我對演員Scrapy代碼：

actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract() 
try: 
    second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract() 
    for n in second: 
     actors.append(n) 
except: 
    actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()

這是覆蓋了事實的企圖：第一個演員可能沒有與他有關的鏈接/她和隨後的演員一樣，第一演員可能與他/她有聯繫，但其他人可能不會。

我很欣賞閱讀本文的時間和任何幫助我找到/解決此問題的嘗試！請讓我知道是否需要更多信息。

來源

2013-08-25 DMML

我假設你只對文本內容感興趣，而不是對演員網頁的鏈接等。

下面是使用lxml.html一個命題（和一點lxml.etree）直接

首先，我建議你通過td[1]文本內容選擇td[2]細胞，與像.//tr[starts-with(td[1], "Director")]/td[2]表達式來解釋「導演」，或者「董事」
二，測試有無，有無<a>等各種表情，使代碼難以閱讀和維護，而且由於您是intereste d僅在文字內容，你還不如用string(.//tr[starts-with(td[1], "Actor")]/td[2])來獲取文本或所選元素
而爲 問題多個名稱使用lxml.html.tostring(e, method="text", encoding=unicode)，我做的方式一般是修改包含的lxml樹針對性的內容到一個特殊格式字符添加到 元件.text或.tail，例如\n，具有lxml一個的iter()功能。這對其他HTML塊元素很有用，例如<hr>。

您可能會看到更好的我的意思了一些蜘蛛代碼：

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
import lxml.etree 
import lxml.html 

MARKER = "|" 
def br2nl(tree): 
    for element in tree: 
     for elem in element.iter("br"): 
      elem.text = MARKER 

def extract_category_lines(tree): 
    if tree is not None and len(tree): 
     # modify the tree by adding a MARKER after <br> elements 
     br2nl(tree) 

     # use lxml's .tostring() to get a unicode string 
     # and split lines on the marker we added above 
     # so we get lists of actors, producers, directors... 
     return lxml.html.tostring(
      tree[0], method="text", encoding=unicode).split(MARKER) 

class BoxOfficeMojoSpider(BaseSpider): 
    name = "boxofficemojo" 
    start_urls = [ 
     "http://www.boxofficemojo.com/movies/?id=actionjackson.htm", 
     "http://www.boxofficemojo.com/movies/?id=cloudatlas.htm", 
    ] 

    # locate 2nd cell by text content of first cell 
    XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]') 
    def parse(self, response): 
     root = lxml.html.fromstring(response.body) 

     # locate the "The Players" table 
     players = root.xpath('//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table') 

     # we have only one table in "players" so the for loop is not really necessary 
     for players_table in players: 

      directors_cells = self.XPATH_CATEGORY_CELL(players_table, 
       category="Director") 
      actors_cells = self.XPATH_CATEGORY_CELL(players_table, 
       category="Actor") 
      producers_cells = self.XPATH_CATEGORY_CELL(players_table, 
       category="Producer") 
      writers_cells = self.XPATH_CATEGORY_CELL(players_table, 
       category="Producer") 
      composers_cells = self.XPATH_CATEGORY_CELL(players_table, 
       category="Composer") 

      directors = extract_category_lines(directors_cells) 
      actors = extract_category_lines(actors_cells) 
      producers = extract_category_lines(producers_cells) 
      writers = extract_category_lines(writers_cells) 
      composers = extract_category_lines(composers_cells) 

      print "Directors:", directors 
      print "Actors:", actors 
      print "Producers:", producers 
      print "Writers:", writers 
      print "Composers:", composers 
      # here you should of course populate scrapy items

的代碼可以簡化爲肯定，但我希望你的想法。

你可以做類似的事情，當然HtmlXPathSelector（與例如string() XPath函數），但不改變樹 （如何做到這一點與HXS？）它只能針對你的情況不多個名稱：

>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract() 
[u'Craig R. Baxley'] 
>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract() 
[u'Carl WeathersCraig T. NelsonSharon Stone']

來源

2013-08-25 22:09:27

哇！非常感謝您花時間回覆！我很好奇，爲了看看會發生什麼，如果這些方法將消除從其他'[@ class =「mp_box_content」]獲取信息的問題，我會很快實現這些事情，因爲這是主要問題之一？ – DMML

您將只獲得「The Players」表格內容，而不是其他的[@ class =「mp_box_content」]'divs。我用'.text'而不是'.tail'修復了'br2nl'，否則一些行被覆蓋。我還介紹了一個編譯XPath表達式，以便您可以將'category'參數作爲XPath變量傳遞，它表示您想要的行的第一個單元格文本 –

XPath：選擇某些子節點

回答

相關問題