2013-07-24 53 views
0

如何讓我的parse_page顯示我的項目標題的文本和數值?我只能顯示href。Scrapy使用lxml顯示xpath文本

def parse_page(self, response): 
    self.log("\n\n\n Page for one device \n\n\n") 
    self.log('Hi, this is the parse_page page! %s' % response.url) 
    root = lxml.etree.fromstring(response.body) 
    for row in root.xpath('//row'): 
     allcells = row.xpath('./cell') 
     #... populate Items 
    for cells in allcells: 
     item = CiqdisItem() 
     item['title'] = cells.get(".//text()") 
     item['link'] = cells.get("href") 
     yield item 

我的XML文件

<row> 
<cell type="html"> 
<input type="checkbox" name="AF2C4452827CF0935B71FAD58652112D" value="AF2C4452827CF0935B71FAD58652112D" onclick="if(typeof(selectPkg)=='function')selectPkg(this);"> 
</cell> 
<cell type="plain" style="width: 50px; white-space: nowrap;" visible="false">http://qvpweb01.ciq.labs.att.com:8080/dis/metriclog.jsp?PKG_GID=AF2C4452827CF0935B71FAD58652112D&amp;view=list</cell> 
<cell type="plain">6505550000</cell> 
<cell type="plain">probe0</cell> 
<cell type="href" style="width: 50px; white-space: nowrap;" href="metriclog.jsp?PKG_GID=AF2C4452827CF0935B71FAD58652112D&view=list"> 
UPTR 
<input id="savePage_AF2C4452827CF0935B71FAD58652112D" type="hidden" value="AF2C4452827CF0935B71FAD58652112D"> 
</cell> 
<cell type="href" href="/dis/packages.jsp?show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&triggerfilter=&maxlength=100&view=timeline&date=20100716T050314876" style="white-space: nowrap;">2010-07-16 05:03:14.876</cell> 
<cell type="plain" style="width: 50px; white-space: nowrap;"></cell> 
<cell type="plain" style="white-space: nowrap;"></cell> 
<cell type="plain" style="white-space: nowrap;">2012-10-22 22:40:15.504</cell> 
<cell type="plain" style="width: 70px; white-space: nowrap;">1 - SMS_PullRequest_CS</cell> 
<cell type="href" style="width: 50px; white-space: nowrap;" href="/dis/profile_download?profileId=4294967295">4294967295</cell> 
<cell type="plain" style="width: 50px; white-space: nowrap;">250</cell> 
</row> 

這是我最新的下方編輯,我展示這兩種方法。問題是第一種方法沒有按順序解析列A中的所有鏈接,它是不合理的,如果列A爲空,它將抓取列B中的下一個鏈接。我如何才能顯示只有列A,並且如果列A爲null跳過它並沿同一列A向下走?

方法2 parse_page。不會迭代所有行。它是不完整的解析。我如何獲得所有行?

def parse_device_list(self, response): 
    self.log("\n\n\n List of devices \n\n\n") 
    self.log('Hi, this is the parse_device_list page! %s' % response.url) 
    root = lxml.etree.fromstring(response.body) 
    for row in root.xpath('//row'): 
     allcells = row.xpath('.//cell') 
     # first cell contain the link to follow 
     detail_page_link = allcells[0].get("href") 
     yield Request(urlparse.urljoin(response.url, detail_page_link), callback=self.parse_page) 

    def parse_page(self, response): 
    self.log("\n\n\n Page for one device \n\n\n") 
    self.log('Hi, this is the parse_page page! %s' % response.url) 
    xxs = XmlXPathSelector(response) 
    for row in xxs.select('//row'): 
     for cell in row.select('.//cell'): 
      item = CiqdisItem() 
      item['title'] = cell.select("text()").extract() 
      item['link'] = cell.select("@href").extract() 
      yield item 

回答

0

只是text()href@href取代.//text()

另外,爲什麼lxml? Scrapy具有XPath選擇內置的,不妨一試:

def parse_page(self, response): 
    hxs = HtmlXPathSelector(response) 
    for row in hxs.select('//row'): 
     for cell in row.select('.//cell'): 
      item = CiqdisItem() 
      item['title'] = cell.get("text()") 
      item['link'] = cell.get("@href") 
      yield item 
+0

感謝@alecxe - 我只是需要從HXS更改爲XXS = XmlXPathSelector(響應)。另外我發佈了另一個問題[鏈接到我的第二個問題](http://stackoverflow.com/questions/17861781/convert-lxml-to-scrapy-xxs-selector)將lxml轉換爲scrapy構建xxs。對於這一個,我第一次嘗試在xxs中做,但失敗了,直到有人告訴我可能試過lxml讓它工作,它在lxml中工作。 – Gio

+0

嗨@alecxe - 在分析我的網絡爬蟲之後,我注意到這並不解析每個表的所有行,它只分成幾行但不是全部。所有行都在同一頁面中。 (它不會限制每頁的行數)。如果我編輯我的問題並粘貼兩個方法,這可能會有所幫助。 – Gio