2016-01-22 69 views
0

我無法運行此腳本。有人能指導我這個腳本究竟有什麼錯誤嗎?所有的xpaths是否正確?無法從really.com爬行

我覺得這部分是錯誤的:

item['job_title'] = site.select('h2/a/@title').extract() 
link_url= site.select('h2/a/@href').extract() 

因爲XPath是不正確的。

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from indeeda.items import IndeedaItem 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.http import Request 
import time 
import sys 

class MySpider(CrawlSpider): 
    name = 'indeed' 
    allowed_domains = ['indeed.com'] 
    start_urls = ['http://www.indeed.com/jobs?q=linux&l=Chicago&sort=date?'] 
    rules = ( 
     Rule(SgmlLinkExtractor(allow=('/jobs.q=linux&l=Chicago&sort=date$','q=linux&l=Chicago&sort=date&start=[0-9]+$',),deny=('/my/mysearches', '/preferences', '/advanced_search','/my/myjobs')), callback='parse_item', follow=True), 

     ) 
    def parse_next_site(self, response): 

     item = response.request.meta['item'] 
     item['source_url'] = response.url 
     item['source_page_body'] = response.body 
     item['crawl_timestamp'] = time.strftime('%Y-%m-%d %H:%M:%S') 

      return item 

    def parse_item(self, response): 
     self.log('\n Crawling %s\n' % response.url) 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select("//div[@class='row ' or @class='row lastRow']") 
     items = [] 
     for site in sites: 
      item = IndeedaItem() 
      item['job_title'] = site.select('h2/a/@title').extract() 
      link_url= site.select('h2/a/@href').extract() 
      item['link_url'] = link_url 
      item['crawl_url'] = response.url 
      item['location'] = site.select("span[@class='location']/text()").extract() 
      tem['summary'] = site.select("//table/tr/td/span[@class='summary']").extract() 
      item['source'] = site.select("table/tr/td/span[@class='source']/text()").extract() 
      item['found_date'] = site.select("table/tr/td/span[@class='date']/text()").extract() 
      #item['source_url'] = self.get_source(link_url) 
      request = Request("http://www.indeed.com" + item['link_url'][0], callback=self.parse_next_site) 
        request.meta['item'] = item 
      yield request 

      items.append(item) 
     return 
SPIDER=MySpider() 

以下是錯誤日誌:

[email protected]-3542:~/indeeda$ scrapy crawl indeed 
/home/hakuna/indeeda/indeeda/spiders/test.py:1: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead 
    from scrapy.spider import BaseSpider 
/home/hakuna/indeeda/indeeda/spiders/test.py:3: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead 
    from scrapy.contrib.spiders import CrawlSpider, Rule 
/home/hakuna/indeeda/indeeda/spiders/test.py:5: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead 
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
/home/hakuna/indeeda/indeeda/spiders/test.py:5: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead 
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
/home/hakuna/indeeda/indeeda/spiders/test.py:15: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor 
    Rule(SgmlLinkExtractor(allow=('/jobs.q=linux&l=Chicago&sort=date$','q=linux&l=Chicago&sort=date&start=[0-9]+$',),deny=('/my/mysearches', '/preferences', '/advanced_search','/my/myjobs')), callback='parse_item', follow=True), 
2016-01-21 21:31:22 [scrapy] INFO: Scrapy 1.0.4 started (bot: indeeda) 
2016-01-21 21:31:22 [scrapy] INFO: Optional features available: ssl, http11, boto 
2016-01-21 21:31:22 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'indeeda.spiders', 'SPIDER_MODULES': ['indeeda.spiders'], 'BOT_NAME': 'indeeda'} 
2016-01-21 21:31:22 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-01-21 21:31:22 [boto] DEBUG: Retrieving credentials from metadata server. 
2016-01-21 21:31:23 [boto] ERROR: Caught exception reading instance data 
Traceback (most recent call last): 
    File "/home/hakuna/anaconda/lib/python2.7/site-packages/boto/utils.py", line 210, in retry_url 
    r = opener.open(req, timeout=timeout) 
    File "/home/hakuna/anaconda/lib/python2.7/urllib2.py", line 431, in open 
    response = self._open(req, data) 
    File "/home/hakuna/anaconda/lib/python2.7/urllib2.py", line 449, in _open 
    '_open', req) 
    File "/home/hakuna/anaconda/lib/python2.7/urllib2.py", line 409, in _call_chain 
    result = func(*args) 
    File "/home/hakuna/anaconda/lib/python2.7/urllib2.py", line 1227, in http_open 
    return self.do_open(httplib.HTTPConnection, req) 
    File "/home/hakuna/anaconda/lib/python2.7/urllib2.py", line 1197, in do_open 
    raise URLError(err) 
URLError: <urlopen error timed out> 
2016-01-21 21:31:23 [boto] ERROR: Unable to read instance data, giving up 
2016-01-21 21:31:23 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-01-21 21:31:23 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-01-21 21:31:23 [scrapy] INFO: Enabled item pipelines: 
2016-01-21 21:31:23 [scrapy] INFO: Spider opened 
2016-01-21 21:31:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-01-21 21:31:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-01-21 21:31:23 [scrapy] DEBUG: Crawled (200) <GET http://www.indeed.com/jobs?q=linux&l=Chicago&sort=date?> (referer: None) 
2016-01-21 21:31:23 [scrapy] ERROR: Spider error processing <GET http://www.indeed.com/jobs?q=linux&l=Chicago&sort=date?> (referer: None) 
Traceback (most recent call last): 
    File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback 
    yield next(it) 
    File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output 
    for x in result: 
    File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 73, in _parse_response 
    for request_or_item in self._requests_to_follow(response): 
    File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 52, in _requests_to_follow 
    links = [l for l in rule.link_extractor.extract_links(response) if l not in seen] 
    File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/linkextractors/sgml.py", line 138, in extract_links 
    links = self._extract_links(body, response.url, response.encoding, base_url) 
    File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/linkextractors/__init__.py", line 103, in _extract_links 
    return self.link_extractor._extract_links(*args, **kwargs) 
    File "/home/hakuna/anaconda/lib/python2.7/site-packages/scrapy/linkextractors/sgml.py", line 36, in _extract_links 
    self.feed(response_text) 
    File "/home/hakuna/anaconda/lib/python2.7/sgmllib.py", line 104, in feed 
    self.goahead(0) 
    File "/home/hakuna/anaconda/lib/python2.7/sgmllib.py", line 174, in goahead 
    k = self.parse_declaration(i) 
    File "/home/hakuna/anaconda/lib/python2.7/markupbase.py", line 98, in parse_declaration 
    decltype, j = self._scan_name(j, i) 
    File "/home/hakuna/anaconda/lib/python2.7/markupbase.py", line 392, in _scan_name 
    % rawdata[declstartpos:declstartpos+20]) 
    File "/home/hakuna/anaconda/lib/python2.7/sgmllib.py", line 111, in error 
    raise SGMLParseError(message) 
SGMLParseError: expected name token at "<!\\\\])/g, '\\\\$1').\n " 
2016-01-21 21:31:23 [scrapy] INFO: Closing spider (finished) 
2016-01-21 21:31:23 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 245, 
'downloader/request_count': 1, 
'downloader/request_method_count/GET': 1, 
'downloader/response_bytes': 28427, 
'downloader/response_count': 1, 
'downloader/response_status_count/200': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 1, 22, 3, 31, 23, 795599), 
'log_count/DEBUG': 3, 
'log_count/ERROR': 3, 
'log_count/INFO': 7, 
'response_received_count': 1, 
'scheduler/dequeued': 1, 
'scheduler/dequeued/memory': 1, 
'scheduler/enqueued': 1, 
'scheduler/enqueued/memory': 1, 
'spider_exceptions/SGMLParseError': 1, 
'start_time': datetime.datetime(2016, 1, 22, 3, 31, 23, 504391)} 
2016-01-21 21:31:23 [scrapy] INFO: Spider closed (finished) 

報告說,棄用大部分庫。

+0

你說的是運行腳本,但是接着出現錯誤的xpath,顯示你正在收到的日誌或錯誤。 – eLRuLL

+0

@eLRuLL:我添加了日誌,謝謝! – dsl1990

+0

爲什麼在文件末尾有'SPIDER = MySpider()'?你如何運行蜘蛛。 – eLRuLL

回答

0

SgmlLinkExtractor正在被棄用,請使用LinkExtractor來代替。

from scrapy.linkextractors import LinkExtractor 

... 
    rules = ( 
     Rule(LinkExtractor(allow=('/jobs.q=linux&l=Chicago&sort=date$','q=linux&l=Chicago&sort=date&start=[0-9]+$',),deny=('/my/mysearches', '/preferences', '/advanced_search','/my/myjobs')), callback='parse_item', follow=True), 
    ) 
... 
+0

我已經更新過,但是當我嘗試將它保存在csv中時,它會將它保存爲空白csv文件。 我用這個命令:scrapy抓取確實-o dees.csv -t csv 我希望能夠跟進這個問題。非常感謝! – dsl1990

+0

表示一個不同的問題,首先運行'scrapy抓取確實是--logfile =「mylog.log」'來檢查蜘蛛是否正在工作,發生錯誤,返回項目等,當一切正常時,然後想着保存你的數據。你的問題主要是關於你的xpath,也許''網站'沒有什麼,檢查。 – eLRuLL

+0

我編輯了答案,出現錯誤。我沒有在這裏發佈太多問題,所以我不知道在哪裏發佈錯誤日誌 – dsl1990