0
所以我的代碼(粘貼)幾乎做我想要的。相反,它涵蓋了29/30頁,然後遺漏了最後一頁。此外,我最好讓它超越,但網站沒有它的按鈕(當你在鏈接中手動填寫頁面= 31時,頁面實際上可以工作)。當Depth_Limit是29這一切都很好,但在30,我得到的命令提示符下以下錯誤:最後一頁不在scrapy中顯示
File "C:\Users\Ewald\Scrapy\OB\OB\spiders\spider_OB.py", line 23, in parse
next_link = 'https://zoek.officielebekendmakingen.nl/' + s.xpath('//a[@class="volgende"]/@href').extract()[0]
IndexError: list index out of range
我已經試過各種方法,但他們都忽視了我......
class OB_Crawler(CrawlSpider):
name = 'OB5'
allowed_domains = ["https://www.officielebekendmakingen.nl/"]
start_urls = ["https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=DatumPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4"]
custom_settings = {
'BOT_NAME': 'OB-crawler',
'DEPTH_LIMIT': 30,
'DOWNLOAD_DELAY': 0.1
}
def parse(self, response):
s = Selector(response)
next_link = 'https://zoek.officielebekendmakingen.nl/' + s.xpath('//a[@class="volgende"]/@href').extract()[0]
if len(next_link):
yield self.make_requests_from_url(next_link)
posts = response.selector.xpath('//div[@class = "lijst"]/ul/li')
for post in posts:
i = TextPostItem()
i['title'] = ' '.join(post.xpath('a/@href').extract()).replace(';', '').replace(' ', '').replace('\r\n', '')
i['link'] = ' '.join(post.xpath('a/text()').extract()).replace(';', '').replace(' ', '').replace('\r\n', '')
i['info'] = ' '.join(post.xpath('a/em/text()').extract()).replace(';', '').replace(' ', '').replace('\r\n', '').replace(',', '-')
yield i