2
我想寫我的第一個網絡爬蟲/數據提取使用scrapy,並不能得到它遵循鏈接。我也得到一個錯誤:Scrapy蜘蛛沒有下面的鏈接和錯誤
ERROR: Spider error processing < GET https://en.wikipedia.org/wiki/Wikipedia:Unusual_articles >
我知道蜘蛛是一次掃描的頁面,因爲我能夠從a
標籤和h1
元素我是搞亂拉出信息。
有誰知道我可以如何使這個按照頁面上的鏈接和擺脫錯誤?
import scrapy
from scrapy.linkextractors import LinkExtractor
from wikiCrawler.items import WikicrawlerItem
from scrapy.spiders import Rule
class WikispyderSpider(scrapy.Spider):
name = "wikiSpyder"
allowed_domains = ['https://en.wikipedia.org/']
start_urls = ['https://en.wikipedia.org/wiki/Wikipedia:Unusual_articles']
rules = (
Rule(LinkExtractor(canonicalize=True, unique=True), follow=True, callback="parse"),
)
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, dont_filter=True)
def parse(self, response):
items = []
links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
for link in links:
item = WikicrawlerItem()
item['url_from'] = response.url
item['url_to'] = link.url
items.append(item)
print(items)
return items
你真了不起,謝謝你的幫助! – Asuu