1
我在嘗試抓取特定網站時遇到了一個奇怪的問題。 如果我用basespider抓取了一些網頁,代碼完美地工作,但如果我改變代碼使用crawlspider,蜘蛛沒有任何錯誤,但沒有蜘蛛完成Scrapy CrawlSpider沒有蜘蛛
下面這段代碼工作正常
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from dirbot.items import Website
from urlparse import urlparse
from scrapy import log
class hushBabiesSpider(BaseSpider):
name = "hushbabies"
#download_delay = 10
allowed_domains = ["hushbabies.com"]
start_urls = [
"http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
"http://www.hushbabies.com/category/mommy-newborn.html",
"http://www.hushbabies.com"
]
def parse(self, response):
print response.body
print "Inside parse Item"
return []
下面的代碼段不起作用
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from dirbot.items import Website
from urlparse import urlparse
from scrapy import log
class hushBabiesSpider(CrawlSpider):
name = "hushbabies"
#download_delay = 10
allowed_domains = ["hushbabies.com"]
start_urls = [
"http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
"http://www.hushbabies.com/category/mommy-newborn.html",
"http://www.hushbabies.com"
]
rules = (
Rule(SgmlLinkExtractor(allow=()),
'parseItem',
follow=True,
),
)
def parseItem(self, response):
print response.body
print "Inside parse Item"
return []
從Scrapy運行的輸出是如下
scrapy crawl hushbabies
2012-07-23 18:50:37+0000 [scrapy] INFO: Scrapy 0.15.1-198-g831a450 started (bot: SKBot)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, WebService, CoreStats, MemoryUsage, SpiderState, CloseSpider
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled item pipelines: SQLStorePipeline
2012-07-23 18:50:37+0000 [hushbabies] INFO: Spider opened
2012-07-23 18:50:37+0000 [hushbabies] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-23 18:50:37+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/robots.txt> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/category/mommy-newborn.html> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Closing spider (finished)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Dumping spider stats:
{'downloader/request_bytes': 634,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 44395,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 23, 18, 50, 39, 674965),
'scheduler/memory_enqueued': 2,
'start_time': datetime.datetime(2012, 7, 23, 18, 50, 37, 700711)}
2012-07-23 18:50:39+0000 [hushbabies] INFO: Spider closed (finished)
2012-07-23 18:50:39+0000 [scrapy] INFO: Dumping global stats:
{'memusage/max': 27820032, 'memusage/startup': 27652096}
將網站從hushbabies.com更改爲其他網站將使代碼正常工作。
聽起來像一個奇怪的錯誤!爲什麼sgmllinkextractor不適用於這個特定的網站?有什麼具體的原因? – 2012-07-26 19:30:24
@Zulubaba:好問題。 – 2012-07-27 14:42:40
在Scrapy文檔中,有這樣一段話:「基於_SGMLParser的鏈接提取程序會被宣告,並且不鼓勵它的使用。如果您仍在使用SgmlLinkExtractor._,建議遷移到LxmlLinkExtractor。」文檔頁鏈接是:[http: /doc.scrapy.org/en/latest/topics/link-extractors.html](http://doc.scrapy.org/en/latest/topics/link-extractors.html) – gm2008 2015-04-17 17:56:23