0
我試圖通過在分頁的財產網站上獲取條目標題來學習Scrapy。我無法從rules
列表中定義的「下一頁」頁面中獲取條目。嘗試使用Scrapy刮分頁鏈接的問題
代碼:
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from property.items import PropertyItem
import re
class VivastreetSpider(CrawlSpider):
name = 'viva'
allowed_domains = ['http://chennai.vivastreet.co.in/']
start_urls = ['http://chennai.vivastreet.co.in/rent+chennai/']
rules = [
Rule(LinkExtractor(restrict_xpaths = ('//*[text()[contains(., "Next")]]')), callback = 'parse_item', follow = True)
]
def parse_item(self, response):
a = Selector(response).xpath('//a[contains(@id, "vs-detail-link")]/text()').extract()
i = 1
for b in a:
print('testtttttttttttttt ' + str(i) + '\n' + str(b))
i += 1
item = PropertyItem()
item['title'] = a[0]
yield item
編輯 - 替換解析法parse_item和現在不能湊什麼。
在結束時忽略項目對象代碼,我打算用請求回調替換爲另一個方法,該方法從每個條目的URL中獲取更多細節。
如果需要,我會發布日誌。
編輯#2-我從分頁頁面獲取URL,然後產生一個請求到另一個方法,最後從每個條目的頁面獲取細節。 parse_start_url()
方法正在工作,但parse_item method()
未被調用。
代碼:
from scrapy import Request
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from property.items import PropertyItem
import sys
reload(sys)
sys.setdefaultencoding('utf8') #To prevent UnicodeDecodeError, UnicodeEncodeError.
class VivastreetSpider(CrawlSpider):
name = 'viva'
allowed_domains = ['chennai.vivastreet.co.in']
start_urls = ['http://chennai.vivastreet.co.in/rent+chennai/']
rules = [
Rule(LinkExtractor(restrict_xpaths = '//*[text()[contains(., "Next")]]'), callback = 'parse_start_url', follow = True)
]
def parse_start_url(self, response):
urls = Selector(response).xpath('//a[contains(@id, "vs-detail-link")][@href]').extract()
print('test0000000000000000000' + str(urls[0]))
for url in urls:
yield Request(url = url, callback = self.parse_item)
def parse_item(self, response):
#item = PropertyItem()
a = Selector(response).xpath('//*h1[@class = "kiwii-font-xlarge kiwii-margin-none"').extract()
print('test tttttttttttttttttt ' + str(a))
感謝您的幫助。我得到了這部分工作,但我調用另一種方法來解析來自該頁面的url提取細節,這似乎並不奏效。我錯了什麼?編輯了這個問題。 –