我無法運行此腳本。有人能指導我這個腳本究竟有什麼錯誤嗎?所有的xpaths是否正確?無法從really.com爬行
我覺得這部分是錯誤的:
item['job_title'] = site.select('h2/a/@title').extract()
link_url= site.select('h2/a/@href').extract()
因爲XPath是不正確的。
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from indeeda.items import IndeedaItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
import time
import sys
class MySpider(CrawlSpider):
name = 'indeed'
allowed_domains = ['indeed.com']
start_urls = ['http://www.indeed.com/jobs?q=linux&l=Chicago&sort=date?']
rules = (
Rule(SgmlLinkExtractor(allow=('/jobs.q=linux&l=Chicago&sort=date$','q=linux&l=Chicago&sort=date&start=[0-9]+$',),deny=('/my/mysearches', '/preferences', '/advanced_search','/my/myjobs')), callback='parse_item', follow=True),
)
def parse_next_site(self, response):
item = response.request.meta['item']
item['source_url'] = response.url
item['source_page_body'] = response.body
item['crawl_timestamp'] = time.strftime('%Y-%m-%d %H:%M:%S')
return item
def parse_item(self, response):
self.log('\n Crawling %s\n' % response.url)
hxs = HtmlXPathSelector(response)
sites = hxs.select("//div[@class='row ' or @class='row lastRow']")
items = []
for site in sites:
item = IndeedaItem()
item['job_title'] = site.select('h2/a/@title').extract()
link_url= site.select('h2/a/@href').extract()
item['link_url'] = link_url
item['crawl_url'] = response.url
item['location'] = site.select("span[@class='location']/text()").extract()
tem['summary'] = site.select("//table/tr/td/span[@class='summary']").extract()
item['source'] = site.select("table/tr/td/span[@class='source']/text()").extract()
item['found_date'] = site.select("table/tr/td/span[@class='date']/text()").extract()
#item['source_url'] = self.get_source(link_url)
request = Request("http://www.indeed.com" + item['link_url'][0], callback=self.parse_next_site)
request.meta['item'] = item
yield request
items.append(item)
return
SPIDER=MySpider()
以下是錯誤日誌:
[email protected]-3542:~/indeeda$ scrapy crawl indeed
/home/hakuna/indeeda/indeeda/spiders/test.py:1: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
from scrapy.spider import BaseSpider
/home/hakuna/indeeda/indeeda/spiders/test.py:3: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
from scrapy.contrib.spiders import CrawlSpider, Rule
/home/hakuna/indeeda/indeeda/spiders/test.py:5: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
/home/hakuna/indeeda/indeeda/spiders/test.py:5: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
/home/hakuna/indeeda/indeeda/spiders/test.py:15: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
Rule(SgmlLinkExtractor(allow=('/jobs.q=linux&l=Chicago&sort=date$','q=linux&l=Chicago&sort=date&start=[0-9]+$',),deny=('/my/mysearches', '/preferences', '/advanced_search','/my/myjobs')), callback='parse_item', follow=True),
2016-01-21 21:31:22 [scrapy] INFO: Scrapy 1.0.4 started (bot: indeeda)
2016-01-21 21:31:22 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-01-21 21:31:22 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'indeeda.spiders', 'SPIDER_MODULES': ['indeeda.spiders'], 'BOT_NAME': 'indeeda'}
2016-01-21 21:31:22 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-21 21:31:22 [boto] DEBUG: Retrieving credentials from metadata server.
2016-01-21 21:31:23 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/home/hakuna/anaconda/lib/python2.7/site-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/home/hakuna/anaconda/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/home/hakuna/anaconda/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/home/hakuna/anaconda/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/home/hakuna/anaconda/lib/python2.7/urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/home/hakuna/anaconda/lib/python2.7/urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-01-21 21:31:23 [boto] ERROR: Unable to read instance data, giving up
2016-01-21 21:31:23 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-21 21:31:23 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-21 21:31:23 [scrapy] INFO: Enabled item pipelines:
2016-01-21 21:31:23 [scrapy] INFO: Spider opened
2016-01-21 21:31:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-21 21:31:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-21 21:31:23 [scrapy] DEBUG: Crawled (200) <GET http://www.indeed.com/jobs?q=linux&l=Chicago&sort=date?> (referer: None)
2016-01-21 21:31:23 [scrapy] ERROR: Spider error processing <GET http://www.indeed.com/jobs?q=linux&l=Chicago&sort=date?> (referer: None)
Traceback (most recent call last):
File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or())
File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or() if _filter(r))
File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or() if _filter(r))
File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 73, in _parse_response
for request_or_item in self._requests_to_follow(response):
File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 52, in _requests_to_follow
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/linkextractors/sgml.py", line 138, in extract_links
links = self._extract_links(body, response.url, response.encoding, base_url)
File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/linkextractors/__init__.py", line 103, in _extract_links
return self.link_extractor._extract_links(*args, **kwargs)
File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/linkextractors/sgml.py", line 36, in _extract_links
self.feed(response_text)
File "/home/hakuna/anaconda/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/home/hakuna/anaconda/lib/python2.7/sgmllib.py", line 174, in goahead
k = self.parse_declaration(i)
File "/home/hakuna/anaconda/lib/python2.7/markupbase.py", line 98, in parse_declaration
decltype, j = self._scan_name(j, i)
File "/home/hakuna/anaconda/lib/python2.7/markupbase.py", line 392, in _scan_name
% rawdata[declstartpos:declstartpos+20])
File "/home/hakuna/anaconda/lib/python2.7/sgmllib.py", line 111, in error
raise SGMLParseError(message)
SGMLParseError: expected name token at "<!\\\\])/g, '\\\\$1').\n "
2016-01-21 21:31:23 [scrapy] INFO: Closing spider (finished)
2016-01-21 21:31:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 245,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 28427,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 1, 22, 3, 31, 23, 795599),
'log_count/DEBUG': 3,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/SGMLParseError': 1,
'start_time': datetime.datetime(2016, 1, 22, 3, 31, 23, 504391)}
2016-01-21 21:31:23 [scrapy] INFO: Spider closed (finished)
報告說,棄用大部分庫。
你說的是運行腳本,但是接着出現錯誤的xpath,顯示你正在收到的日誌或錯誤。 – eLRuLL
@eLRuLL:我添加了日誌,謝謝! – dsl1990
爲什麼在文件末尾有'SPIDER = MySpider()'?你如何運行蜘蛛。 – eLRuLL