2015-04-04 152 views
-1

我想通過進入頁面進入頁面來刮掉agent_name的聯繫人詳細信息。有時,這個腳本返回給我一個條目,有時候不同的條目無法找出原因。我的刮刀出了什麼問題?

import scrapy 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.selector import Selector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from urlparse import urljoin 


class CompItem(scrapy.Item): 
    title = scrapy.Field() 
    link = scrapy.Field() 
    data = scrapy.Field() 


class criticspider(CrawlSpider): 
    name = "comp" 
    allowed_domains = ["iproperty.com.my"] 
    start_urls = ["http://www.iproperty.com.my/property/searchresult.aspx?t=S&gpt=AR&st=&ct=&k=&pt=&mp=&xp=&mbr=&xbr=&mbu=&xbu=&lo=&wp=&wv=&wa=&ht=&au=&sby=&ns=1"] 


    def parse(self, response): 
     sites = response.xpath('.//*[@id="frmSaveListing"]/ul') 
     items = [] 

     for site in sites: 
      item = CompItem() 
      item['title'] = site.xpath('.//li[2]/div[3]/div[1]/div[2]/p[1]/a/text()').extract()[0] 
      item['link'] = site.xpath('.//li[2]/div[3]/div[1]/div[2]/p[1]/a/@href').extract()[0] 
      if item['link']: 
       if 'http://' not in item['link']: 
        item['link'] = urljoin(response.url, item['link']) 
       yield scrapy.Request(item['link'], 
            meta={'item': item}, 
            callback=self.anchor_page) 

      items.append(item) 

    def anchor_page(self, response): 
     old_item = response.request.meta['item'] 

     old_item['data'] = response.xpath('.//*[@id="main-content3"]/div[1]/div/table/tbody/tr/td[1]/table/tbody/tr[3]/td/text()').extract() 
     yield old_item 
+0

當您的代碼執行並且不起作用時,您看過網頁的變化嗎? – 2015-04-04 14:35:26

+0

我檢查了網頁,它隨新的列表更改,但它應該拉出匹配xml路徑的數據? – nik 2015-04-04 14:36:40

回答

0

即使您在瀏覽器中打開起始網址並多次刷新頁面,您也會得到不同的搜索結果。

無論如何,你的蜘蛛需要調整,因爲它不提取網頁中的所有代理:

import scrapy 
from urlparse import urljoin 


class CompItem(scrapy.Item): 
    title = scrapy.Field() 
    link = scrapy.Field() 
    data = scrapy.Field() 


class criticspider(scrapy.Spider): 
    name = "comp" 

    allowed_domains = ["iproperty.com.my"] 
    start_urls = ["http://www.iproperty.com.my/property/searchresult.aspx?t=S&gpt=AR&st=&ct=&k=&pt=&mp=&xp=&mbr=&xbr=&mbu=&xbu=&lo=&wp=&wv=&wa=&ht=&au=&sby=&ns=1"] 


    def parse(self, response): 
     agents = response.xpath('//li[@class="search-listing"]//div[@class="article-right"]') 
     for agent in agents: 
      item = CompItem() 
      item['title'] = agent.xpath('.//a/text()').extract()[0] 
      item['link'] = agent.xpath('.//a/@href').extract()[0] 
      yield scrapy.Request(urljoin("http://www.iproperty.com.my", item['link']), 
           meta={'item': item}, 
           callback=self.anchor_page) 


    def anchor_page(self, response): 
     old_item = response.request.meta['item'] 

     old_item['data'] = response.xpath('.//*[@id="main-content3"]//table//table//p/text()').extract() 
     yield old_item 

我已經解決:

  • 使用scrapy.Spider代替CrawlSpider
  • 修復了XPath表達式,使其遍歷頁面上的所有代理,然後訪問鏈接並獲取代理的自我描述/促銷
+0

Thanx幫助好友 – nik 2015-04-05 15:39:04

+0

嘿,你能建議我如何編寫一個規則來解析所有頁面 – nik 2015-04-05 15:39:48

+0

@nik你能否請你詳細說明一個單獨的問題?謝謝 – alecxe 2015-04-05 15:46:02