1
我在Windows Vista 64位上使用Python.org版本2.7 64位。我有一個是由一個網站,我抗刮措施抓一些遞歸webscraping代碼正在看:time.sleep()函數不能在Scrapy中工作遞歸webscraper
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
class ExampleSpider(CrawlSpider):
name = "goal3"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/"]
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
scripts = response.selector.xpath("normalize-space(//title)")
for scripts in scripts:
body = response.xpath('//p').extract()
body2 = "".join(body)
print remove_tags(body2).encode('utf-8')
time.sleep(5)
execute(['scrapy','crawl','goal3'])
爲了阻止這種情況發生,我已經嘗試添加一個基本的「time.sleep()」功能可以減慢提交文件的速度。但是,通過命令提示符運行代碼時,此功能似乎沒有任何效果。代碼繼續以相同的速度運行,因此所有請求都以HTTP 403返回。
任何人都可以看到爲什麼這可能不起作用?
謝謝
如果你想使這種行爲動態化,你應該看到http://doc.scrapy.org/en/latest/topics/autothrottle.html。 –