2016-07-14 92 views
0

我試圖通過在分頁的財產網站上獲取條目標題來學習Scrapy。我無法從rules列表中定義的「下一頁」頁面中獲取條目。嘗試使用Scrapy刮分頁鏈接的問題

代碼:

from scrapy import Spider 
from scrapy.selector import Selector 
from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import Rule, CrawlSpider 
from property.items import PropertyItem 
import re 

class VivastreetSpider(CrawlSpider): 
    name = 'viva' 
    allowed_domains = ['http://chennai.vivastreet.co.in/'] 
    start_urls = ['http://chennai.vivastreet.co.in/rent+chennai/'] 
    rules = [ 
     Rule(LinkExtractor(restrict_xpaths = ('//*[text()[contains(., "Next")]]')), callback = 'parse_item', follow = True) 
     ] 

    def parse_item(self, response): 
     a = Selector(response).xpath('//a[contains(@id, "vs-detail-link")]/text()').extract() 
     i = 1 
     for b in a: 
      print('testtttttttttttttt ' + str(i) + '\n' + str(b)) 
      i += 1 
     item = PropertyItem() 
     item['title'] = a[0] 
     yield item 

編輯 - 替換解析法parse_item和現在不能湊什麼。

在結束時忽略項目對象代碼,我打算用請求回調替換爲另一個方法,該方法從每個條目的URL中獲取更多細節。

如果需要,我會發布日誌。

編輯#2-我從分頁頁面獲取URL,然後產生一個請求到另一個方法,最後從每個條目的頁面獲取細節。 parse_start_url()方法正在工作,但parse_item method()未被調用。

代碼:

from scrapy import Request 
from scrapy.selector import Selector 
from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import Rule, CrawlSpider 
from property.items import PropertyItem 
import sys 

reload(sys) 
sys.setdefaultencoding('utf8') #To prevent UnicodeDecodeError, UnicodeEncodeError. 

class VivastreetSpider(CrawlSpider): 
    name = 'viva' 
    allowed_domains = ['chennai.vivastreet.co.in'] 
    start_urls = ['http://chennai.vivastreet.co.in/rent+chennai/'] 
    rules = [ 
     Rule(LinkExtractor(restrict_xpaths = '//*[text()[contains(., "Next")]]'), callback = 'parse_start_url', follow = True) 
     ] 

    def parse_start_url(self, response): 
     urls = Selector(response).xpath('//a[contains(@id, "vs-detail-link")][@href]').extract()  
     print('test0000000000000000000' + str(urls[0])) 
     for url in urls: 
      yield Request(url = url, callback = self.parse_item) 

    def parse_item(self, response): 
     #item = PropertyItem() 
     a = Selector(response).xpath('//*h1[@class = "kiwii-font-xlarge kiwii-margin-none"').extract() 
     print('test tttttttttttttttttt ' + str(a)) 

回答

0

有幾件事情是錯了你的蜘蛛。

  1. allowed_domains被打破,如果你檢查你的蜘蛛,你可能得到了很多過濾掉的請求。

  2. 你誤解了CrawlSpider這裏有點。首先當CrawlSpider啓動時,它下載start_urls中的每個網址並調用parse_start_url

所以你的蜘蛛應該是這個樣子:

class VivastreetSpider(CrawlSpider): 
    name = 'test' 
    allowed_domains = ['chennai.vivastreet.co.in'] 
    start_urls = ['http://chennai.vivastreet.co.in/rent+chennai/'] 
    rules = [ 
    Rule(
     LinkExtractor(
     restrict_xpaths='//*[text()[contains(., "Next")]]'), 
     callback='parse_start_url' 
    ) 
    ] 

    def parse_start_url(self, response): 
     a = Selector(response).xpath('//a[contains(@id, "vs-detail-link")]/text()').extract() 
     return {'test': len(a)} 
+0

感謝您的幫助。我得到了這部分工作,但我調用另一種方法來解析來自該頁面的url提取細節,這似乎並不奏效。我錯了什麼?編輯了這個問題。 –