2017-04-24 64 views
0

這非常奇怪,我用它的管道編寫了scrapy代碼並抓取了大量的數據,它總是運行良好。今天,當我重新運行相同的代碼時,它突然不起作用。下面是詳細信息:python scrapy管道突然不起作用

我的蜘蛛 - base_url_spider.py

import re 
from bs4 import BeautifulSoup 
from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 


class BaseURLSpider(CrawlSpider): 
    ''' 
    This class is responsible for crawling globe and mail articles and their comments 
    ''' 
    name = 'BaseURL' 
    allowed_domains = ["www.theglobeandmail.com"] 

    # seed urls 
    url_path = r'../Sample_Resources/Online_Resources/sample_seed_urls.txt' 
    start_urls = [line.strip() for line in open(url_path).readlines()] 

    # Rules for including and excluding urls 
    rules = (
    Rule(LinkExtractor(allow=r'\/opinion\/.*\/article\d+\/$'), callback="parse_articles"), 
) 

    def __init__(self, **kwargs): 
     ''' 
     :param kwargs: 
     Read user arguments and initialize variables 
     ''' 
     CrawlSpider.__init__(self) 

     self.headers = ({'User-Agent': 'Mozilla/5.0', 
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 
        'X-Requested-With': 'XMLHttpRequest'}) 
     self.ids_seen = set() 


    def parse_articles(self, response): 
     article_ptn = "http://www.theglobeandmail.com/opinion/(.*?)/article(\d+)/" 
     resp_url = response.url 
     article_m = re.match(article_ptn, resp_url) 
     article_id = '' 
     if article_m != None: 
      article_id = article_m.group(2) 
      if article_id not in self.ids_seen: 
       self.ids_seen.add(article_id) 

       soup = BeautifulSoup(response.text, 'html.parser') 
       content = soup.find('div', {"class":"column-2 gridcol"}) 
       if content != None: 
        text = content.findAll('p', {"class":''}) 
        if len(text) > 0: 
          print('*****In Spider, Article ID*****', article_id) 
          print('***In Spider, Article URL***', resp_url) 

          yield {article_id: {"article_url": resp_url}} 

如果我只運行我的蜘蛛代碼,通過命令行scrapy runspider --logfile ../logs/log.txt ScrapeNews/spiders/article_base_url_spider.py。它可以抓取start_urls中的所有網址。

我的管道 - base_url_pipelines.py

import json 


class BaseURLPipelines(object): 

    def process_item(self, item, spider): 
     article_id = list(item.keys())[0] 
     print("****Pipeline***", article_id) 
     f_name = r'../Sample_Resources/Online_Resources/sample_base_urls.txt' 
     with open(f_name, 'a') as out: 
      json.dump(item, out) 
      out.write("\n") 

     return(item) 

我的設置 - settings.py 我有這行註釋掉:

BOT_NAME = 'ScrapeNews' 
SPIDER_MODULES = ['ScrapeNews.spiders'] 
NEWSPIDER_MODULE = 'ScrapeNews.spiders' 
ROBOTSTXT_OBEY = True 
DOWNLOAD_DELAY = 3 
ITEM_PIPELINES = { 
'ScrapeNews.article_comment_pipelines.ArticleCommentPipeline': 400, 
} 

我scrapy.cfg 此文件應該用來指示設置文件

0的位置
# Automatically created by: scrapy startproject 
# 
# For more information about the [deploy] section see: 
# https://scrapyd.readthedocs.org/en/latest/deploy.html 

[settings] 
default = ScrapeNews.settings 

[deploy] 
#url = http://localhost:6800/ 
project = ScrapeNews 

所有這些東西用來在一起很好地工作。

然而,今天當我重新運行的代碼,我得到了這個類型的輸出日誌:

2017-04-24 14:14:15 [scrapy] INFO: Enabled item pipelines: 
['ScrapeNews.article_comment_pipelines.ArticleCommentPipeline'] 
2017-04-24 14:14:15 [scrapy] INFO: Spider opened 
2017-04-24 14:14:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-04-24 14:14:15 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-04-24 14:14:15 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/robots.txt> (referer: None) 
2017-04-24 14:14:20 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/austerity-is-here-all-that-matters-is-the-math/article627532/> (referer: None) 
2017-04-24 14:14:24 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/ontario-can-no-longer-hide-from-taxes-restraint/article546776/> (referer: None) 
2017-04-24 14:14:24 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.theglobeandmail.com/life/life-video/video-what-was-starbucks-thinking-with-their-new-unicorn-frappuccino/article34787773/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 
2017-04-24 14:14:31 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/for-palestinians-the-other-enemy-is-their-own-leadership/article15019936/> (referer: None) 
2017-04-24 14:14:32 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/would-quebecs-partitiongo-back-on-the-table/article17528694/> (referer: None) 
2017-04-24 14:14:36 [scrapy] INFO: Received SIG_UNBLOCK, shutting down gracefully. Send again to force 
2017-04-24 14:14:36 [scrapy] INFO: Closing spider (shutdown) 
2017-04-24 14:14:36 [scrapy] INFO: Received SIG_UNBLOCK twice, forcing unclean shutdown 

上述異常日誌輸出相比,如果我只在這裏經營我的蜘蛛,日誌被罰款,表現出這樣的事情:

2017-04-24 14:21:20 [scrapy] DEBUG: Scraped from <200 http://www.theglobeandmail.com/opinion/were-ripe-for-a-great-disruption-in-higher-education/article543479/> 
{'543479': {'article_url': 'http://www.theglobeandmail.com/opinion/were-ripe-for-a-great-disruption-in-higher-education/article543479/'}} 
2017-04-24 14:21:20 [scrapy] DEBUG: Scraped from <200 http://www.theglobeandmail.com/opinion/saint-making-the-blessed-politics-of-canonization/article624413/> 
{'624413': {'article_url': 'http://www.theglobeandmail.com/opinion/saint-making-the-blessed-politics-of-canonization/article624413/'}} 
2017-04-24 14:21:20 [scrapy] INFO: Closing spider (finished) 
2017-04-24 14:21:20 [scrapy] INFO: Dumping Scrapy stats: 

在上述的異常的日誌輸出,我已經注意到了類似的機器人:

2017-04-24 14:14:15 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-04-24 14:14:15 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/robots.txt> (referer: None) 

GET http://www.theglobeandmail.com/robots.txt從未出現在整個正常的日誌輸出中。但是,當我在瀏覽器中輸入此內容時,不太明白它是什麼。 所以我不確定是否因爲我爬行的網站增加了一些漫遊器?

或者問題來自收到SIG_UNBLOCK,優雅地關閉?但我沒有找到任何解決方案。

我以前運行代碼的命令行是scrapy runspider --logfile ../../Logs/log.txt base_url_spider.py

你知道如何處理這個問題呢?

回答

0

robots.txt是一個網站用來讓網絡爬蟲知道這個網站是否被允許被抓取的文件。 您設置了ROBOTSTXT_OBEY = True,這意味着scrapy將服從robots.txt的設置。

更改ROBOTSTXT_OBEY = False,它應該工作。

+0

非常感謝您指出這一點!我將其改爲False,而不是機器人不再存在,但仍然有'Received SIG_UNBLOCK,優雅地關閉。再次發送強制'。 –

+0

在獲取SIG_UNBLOCK時,您使用什麼命令運行蜘蛛? –

+0

我用'scrapy runspider --logfile ../../Logs/log.txt base_url_spider.py' –