我一直堅持這幾天,它讓我瘋了。Scrapy:__init__中設置的規則被CrawlSpider忽略
我打電話給我的scrapy蜘蛛這樣的:
scrapy crawl example -a follow_links="True"
我通過在「FOLLOW_LINKS」標誌,以確定整個網站是否應該被刮掉,或者只是索引頁我在蜘蛛定義。
這個標誌在蜘蛛的構造檢查,看哪些規則應該設置:
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
self.follow_links = kwargs.get('follow_links')
if self.follow_links == "True":
self.rules = (
Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
)
else:
self.rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
如果是「真」,各個環節都允許的;如果它是「假」,則所有鏈接都被拒絕。
到目前爲止,這麼好,但是這些規則被忽略了。我可以遵循規則的唯一方法是如果我在構造函數之外定義它們。這意味着,像這樣的正常工作:
class ExampleSpider(CrawlSpider):
rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
def __init__(self, *args, **kwargs):
...
因此,基本上,定義__init__
構造函數中的規則會導致忽略的規則,而定義之外的構造函數按預期工作的規則。
我無法理解這一點。我的代碼如下。
import re
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags, remove_comments, replace_escape_chars, replace_entities, remove_tags_with_content
class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
# if the rule below is uncommented, it works as expected (i.e. follow links and call parse_pages)
# rules = (
# Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
#)
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
# single page or follow links
self.follow_links = kwargs.get('follow_links')
if self.follow_links == "True":
# the rule below will always be ignored (why?!)
self.rules = (
Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
)
else:
# the rule below will always be ignored (why?!)
self.rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
def parse_pages(self, response):
print("In parse_pages")
print(response.xpath('/html/body').extract())
return None
def parse_start_url(self, response):
print("In parse_start_url")
print(response.xpath('/html/body').extract())
return None
謝謝您花時間幫助我解決這個問題。
你可以嘗試調用之前設置你的規則'超(ExampleSpider,...' – eLRuLL
@eLRuLL,你應該張貼此作爲一個答案 –