0
我寫了一個Scrapy中的蜘蛛,它基本上做得很好,並且完全做它應該做的事情。但問題是,在日誌中,當我執行的scrapy爬行抓取Scrapy的URL正則表達式
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from ecommerce.items import ArticleItem
class WikiSpider(CrawlSpider):
name = 'wiki'
start_urls = (
'http://www.wiki.tn/index.php',
)
rules= [Rule(SgmlLinkExtractor(allow=[r'\w+\/\d{1,4}\/\d{1,4}\/\d{1,4}\X+']),follow=True, callback='parse_Article_wiki'),
]
def parse_Article_wiki(self, response):
hxs = HtmlXPathSelector(response)
item = ArticleItem()
print '*******************>> '+response.url
但它不到風度工作時,我執行蜘蛛它表明我:
2014-07-09 15:03:13+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware,
OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-07-09 15:03:13+0100 [scrapy] INFO: Enabled item pipelines:
2014-07-09 15:03:13+0100 [wiki] INFO: Spider opened
2014-07-09 15:03:13+0100 [wiki] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-07-09 15:03:13+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-07-09 15:03:13+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-07-09 15:03:13+0100 [wiki] DEBUG: Crawled (200) <GET http://www.wiki.tn/index.php> (referer: None)
2014-07-09 15:03:13+0100 [wiki] INFO: Closing spider (finished)
2014-07-09 15:03:13+0100 [wiki] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 219,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 13062,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 7, 9, 14, 3, 13, 416073),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 7, 9, 14, 3, 13, 210430)}
2014-07-09 15:03:13+0100 [wiki] INFO: Spider closed (finished)
什麼的'\ X +'在你'allow'模式到底用意何在?我在https://docs.python.org/2/library/re.html中看不到它的支持。沒有它,你應該很好 –