2012-07-23 202 views
1

我在嘗試抓取特定網站時遇到了一個奇怪的問題。 如果我用basespider抓取了一些網頁,代碼完美地工作,但如果我改變代碼使用crawlspider,蜘蛛沒有任何錯誤,但沒有蜘蛛完成Scrapy CrawlSpider沒有蜘蛛

下面這段代碼工作正常

from scrapy.spider import BaseSpider 
    from scrapy.selector import HtmlXPathSelector 
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
    from scrapy.contrib.spiders import CrawlSpider, Rule 
    from scrapy.contrib.loader import XPathItemLoader 
    from dirbot.items import Website 
    from urlparse import urlparse 
    from scrapy import log 


class hushBabiesSpider(BaseSpider): 
    name = "hushbabies" 
    #download_delay = 10 
    allowed_domains = ["hushbabies.com"] 
    start_urls = [ 
     "http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html", 
     "http://www.hushbabies.com/category/mommy-newborn.html", 
     "http://www.hushbabies.com" 

    ] 
    def parse(self, response): 
     print response.body 
     print "Inside parse Item" 
     return [] 

下面的代碼段不起作用

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.loader import XPathItemLoader 
from dirbot.items import Website 
from urlparse import urlparse 
from scrapy import log 

class hushBabiesSpider(CrawlSpider): 
    name = "hushbabies" 
    #download_delay = 10 
    allowed_domains = ["hushbabies.com"] 
    start_urls = [ 
     "http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html", 
     "http://www.hushbabies.com/category/mommy-newborn.html", 
     "http://www.hushbabies.com" 

    ] 
    rules = (
     Rule(SgmlLinkExtractor(allow=()), 
      'parseItem', 
      follow=True, 
     ), 
    ) 
    def parseItem(self, response): 
     print response.body 
     print "Inside parse Item" 
     return [] 

從Scrapy運行的輸出是如下

scrapy crawl hushbabies 
2012-07-23 18:50:37+0000 [scrapy] INFO: Scrapy 0.15.1-198-g831a450 started (bot: SKBot) 
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, WebService, CoreStats, MemoryUsage, SpiderState, CloseSpider 
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled item pipelines: SQLStorePipeline 
2012-07-23 18:50:37+0000 [hushbabies] INFO: Spider opened 
2012-07-23 18:50:37+0000 [hushbabies] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2012-07-23 18:50:37+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/robots.txt> (referer: None) 
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com> (referer: None) 
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/category/mommy-newborn.html> (referer: None) 
2012-07-23 18:50:39+0000 [hushbabies] INFO: Closing spider (finished) 
2012-07-23 18:50:39+0000 [hushbabies] INFO: Dumping spider stats: 
     {'downloader/request_bytes': 634, 
     'downloader/request_count': 3, 
     'downloader/request_method_count/GET': 3, 
     'downloader/response_bytes': 44395, 
     'downloader/response_count': 3, 
     'downloader/response_status_count/200': 3, 
     'finish_reason': 'finished', 
     'finish_time': datetime.datetime(2012, 7, 23, 18, 50, 39, 674965), 
     'scheduler/memory_enqueued': 2, 
     'start_time': datetime.datetime(2012, 7, 23, 18, 50, 37, 700711)} 
2012-07-23 18:50:39+0000 [hushbabies] INFO: Spider closed (finished) 
2012-07-23 18:50:39+0000 [scrapy] INFO: Dumping global stats: 
     {'memusage/max': 27820032, 'memusage/startup': 27652096} 

將網站從hushbabies.com更改爲其他網站將使代碼正常工作。

回答

1

SgmlLinkExtractor中的底層SGML解析器似乎有問題,sgmllib

下面的代碼返回零鏈接:

>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
>>> fetch('http://www.hushbabies.com/') 
>>> len(SgmlLinkExtractor().extract_links(response)) 
0 

你可以嘗試從Slybot替代鏈接提取取決於Scraply:

>>> from slybot.linkextractor import LinkExtractor 
>>> from scrapely.htmlpage import HtmlPage 
>>> p = HtmlPage(body=response.body_as_unicode()) 
>>> sum(1 for _ in LinkExtractor().links_to_follow(p)) 
314 
+0

聽起來像一個奇怪的錯誤!爲什麼sgmllinkextractor不適用於這個特定的網站?有什麼具體的原因? – 2012-07-26 19:30:24

+0

@Zulubaba:好問題。 – 2012-07-27 14:42:40

+0

在Scrapy文檔中,有這樣一段話:「基於_SGMLParser的鏈接提取程序會被宣告,並且不鼓勵它的使用。如果您仍在使用SgmlLinkExtractor._,建議遷移到LxmlLinkExtractor。」文檔頁鏈接是:[http: /doc.scrapy.org/en/latest/topics/link-extractors.html](http://doc.scrapy.org/en/latest/topics/link-extractors.html) – gm2008 2015-04-17 17:56:23