2017-05-16 134 views
1

我爬行下面這個簡單的蜘蛛:Scrapy CLOSESPIDER_ITEMCOUNT設置不工作

import scrapy 
from tutorial.items import QuoteItem 

class QuotesSpider(scrapy.Spider): 
    name = "quotes" 

    custom_settings = { 
         'CLOSESPIDER_ITEMCOUNT': 2, 
         'FEED_URI': 'quotes.jl', 
         } 

    start_urls = ['http://quotes.toscrape.com/page/{n}/'.format(n=n) for n in range(1, 4)] 

    def parse(self, response): 
     for quote in response.css('div.quote'): 
      item = QuoteItem() 
      item['text'] = quote.css('span.text::text').extract_first() 
      item['author'] = quote.css('small.author::text').extract_first() 
      item['tags'] = quote.css('div.tags a.tag::text').extract() 
      yield item 

其中items.py

import scrapy 

class QuoteItem(scrapy.Item): 
    text = scrapy.Field() 
    author = scrapy.Field() 
    tags = scrapy.Field() 

如果我scrapy crawl quotes,日誌如下:

2017-05-16 18:07:52 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tutorial) 
2017-05-16 18:07:52 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'} 
2017-05-16 18:07:52 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.closespider.CloseSpider', 
'scrapy.extensions.feedexport.FeedExporter', 
'scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2017-05-16 18:07:52 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-05-16 18:07:52 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-05-16 18:07:52 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-05-16 18:07:52 [scrapy.core.engine] INFO: Spider opened 
2017-05-16 18:07:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-05-16 18:07:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-05-16 18:07:52 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 
2017-05-16 18:07:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) 
2017-05-16 18:07:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None) 
2017-05-16 18:07:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/3/> (referer: None) 
2017-05-16 18:07:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> 
{'author': u'Albert Einstein', 
'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 
'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'} 
2017-05-16 18:07:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> 
{'author': u'J.K. Rowling', 
'tags': [u'abilities', u'choices'], 
'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d'} 
2017-05-16 18:07:52 [scrapy.core.engine] INFO: Closing spider (closespider_itemcount) 
2017-05-16 18:07:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> 
{'author': u'Albert Einstein', 
'tags': [u'inspirational', u'life', u'live', u'miracle', u'miracles'], 
'text': u'\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d'} 
2017-05-16 18:07:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> 
{'author': u'Jane Austen', 
'tags': [u'aliteracy', u'books', u'classic', u'humor'], 
'text': u'\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d'} 
2017-05-16 18:07:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> 
{'author': u'Marilyn Monroe', 
'tags': [u'be-yourself', u'inspirational'], 
'text': u"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"} 
2017-05-16 18:07:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> 
{'author': u'Albert Einstein', 
'tags': [u'adulthood', u'success', u'value'], 
'text': u'\u201cTry not to become a man of success. Rather become a man of value.\u201d'} 
2017-05-16 18:07:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> 
{'author': u'Andr\xe9 Gide', 
'tags': [u'life', u'love'], 
'text': u'\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d'} 
2017-05-16 18:07:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> 
{'author': u'Thomas A. Edison', 
'tags': [u'edison', u'failure', u'inspirational', u'paraphrased'], 
'text': u"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"} 
2017-05-16 18:07:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> 
{'author': u'Eleanor Roosevelt', 
'tags': [u'misattributed-eleanor-roosevelt'], 
'text': u"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> 
{'author': u'Steve Martin', 
'tags': [u'humor', u'obvious', u'simile'], 
'text': u'\u201cA day without sunshine is like, you know, night.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/> 
{'author': u'Marilyn Monroe', 
'tags': [u'friends', 
      u'heartbreak', 
      u'inspirational', 
      u'life', 
      u'love', 
      u'sisters'], 
'text': u"\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d"} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/> 
{'author': u'J.K. Rowling', 
'tags': [u'courage', u'friends'], 
'text': u'\u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/> 
{'author': u'Albert Einstein', 
'tags': [u'simplicity', u'understand'], 
'text': u"\u201cIf you can't explain it to a six year old, you don't understand it yourself.\u201d"} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/> 
{'author': u'Bob Marley', 
'tags': [u'love'], 
'text': u"\u201cYou may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect\u2014you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break\u2014her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.\u201d"} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/> 
{'author': u'Dr. Seuss', 
'tags': [u'fantasy'], 
'text': u'\u201cI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/> 
{'author': u'Douglas Adams', 
'tags': [u'life', u'navigation'], 
'text': u'\u201cI may not have gone where I intended to go, but I think I have ended up where I needed to be.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/> 
{'author': u'Elie Wiesel', 
'tags': [u'activism', 
      u'apathy', 
      u'hate', 
      u'indifference', 
      u'inspirational', 
      u'love', 
      u'opposite', 
      u'philosophy'], 
'text': u"\u201cThe opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.\u201d"} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/3/> 
{'author': u'Pablo Neruda', 
'tags': [u'love', u'poetry'], 
'text': u'\u201cI love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/3/> 
{'author': u'Ralph Waldo Emerson', 
'tags': [u'happiness'], 
'text': u'\u201cFor every minute you are angry you lose sixty seconds of happiness.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/3/> 
{'author': u'Mother Teresa', 
'tags': [u'attributed-no-source'], 
'text': u'\u201cIf you judge people, you have no time to love them.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/> 
{'author': u'Friedrich Nietzsche', 
'tags': [u'friendship', 
      u'lack-of-friendship', 
      u'lack-of-love', 
      u'love', 
      u'marriage', 
      u'unhappy-marriage'], 
'text': u'\u201cIt is not a lack of love, but a lack of friendship that makes unhappy marriages.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/> 
{'author': u'Mark Twain', 
'tags': [u'books', u'contentment', u'friends', u'friendship', u'life'], 
'text': u'\u201cGood friends, good books, and a sleepy conscience: this is the ideal life.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/> 
{'author': u'Allen Saunders', 
'tags': [u'fate', 
      u'life', 
      u'misattributed-john-lennon', 
      u'planning', 
      u'plans'], 
'text': u'\u201cLife is what happens to us while we are making other plans.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/3/> 
{'author': u'Garrison Keillor', 
'tags': [u'humor', u'religion'], 
'text': u'\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/3/> 
{'author': u'Jim Henson', 
'tags': [u'humor'], 
'text': u'\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/3/> 
{'author': u'Dr. Seuss', 
'tags': [u'comedy', u'life', u'yourself'], 
'text': u'\u201cToday you are You, that is truer than true. There is no one alive who is Youer than You.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/3/> 
{'author': u'Albert Einstein', 
'tags': [u'children', u'fairy-tales'], 
'text': u'\u201cIf you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/3/> 
{'author': u'J.K. Rowling', 
'tags': [], 
'text': u'\u201cIt is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all - in which case, you fail by default.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/3/> 
{'author': u'Albert Einstein', 
'tags': [u'imagination'], 
'text': u'\u201cLogic will get you from A to Z; imagination will get you everywhere.\u201d'} 
2017-05-16 18:07:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/3/> 
{'author': u'Bob Marley', 
'tags': [u'music'], 
'text': u'\u201cOne good thing about music, when it hits you, you feel no pain.\u201d'} 
2017-05-16 18:07:53 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (30 items) in: quotes.jl 
2017-05-16 18:07:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 899, 
'downloader/request_count': 4, 
'downloader/request_method_count/GET': 4, 
'downloader/response_bytes': 8287, 
'downloader/response_count': 4, 
'downloader/response_status_count/200': 3, 
'downloader/response_status_count/404': 1, 
'finish_reason': 'closespider_itemcount', 
'finish_time': datetime.datetime(2017, 5, 16, 16, 7, 53, 80572), 
'item_scraped_count': 30, 
'log_count/DEBUG': 35, 
'log_count/INFO': 8, 
'response_received_count': 4, 
'scheduler/dequeued': 3, 
'scheduler/dequeued/memory': 3, 
'scheduler/enqueued': 3, 
'scheduler/enqueued/memory': 3, 
'start_time': datetime.datetime(2017, 5, 16, 16, 7, 52, 721094)} 
2017-05-16 18:07:53 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount) 

stats令我困惑的是item_scraped_count30而我將CLOSESPIDER_ITEMCOUNT設置爲2.事實上,如果我查看quotes.jl,那麼在那裏可以看到30條JSON行。爲什麼蜘蛛抓取所有30個物品而不是2個?

回答

1

實際上發生的事情是,CLOSESPIDER_ITEMCOUNT定義了何時觸發蜘蛛關閉,但由於Scrapy的異步特性,它將完成所有正在處理的項目。因此,如果您有30個活動瀏覽器,但只想解析2個項目,則會解析它們以及在此期間正在解析的所有其他項目(可能會共計約30-32個項目)。

一些可能的解決方案是減少並行至約2瀏覽器,或者如果你真的需要的物品固定量可以滴用管道額外的物品。

+0

如果我增加start_urls'的'數量(通過增加N'的'的範圍內)我也看到了'CLOSESPIDER_ITEMCOUNT'「作品」,因爲它限制刮項目的數量,但刮掉項目總數確實不是那個數字。 –