2017-02-27 64 views
1

我如何獲得scrapy管道以填充我的物品的mongodb?這是我的代碼現在看起來是什麼,這反映了我從scrapy文檔中獲得的信息。 我也想提一下,我已經嘗試了返回的物品而不是屈服,以及嘗試使用物品裝載機。所有的方法似乎都有相同的結果。 在那個筆記我想提到的是,如果我運行命令 mongoimport --db mydb --collection mycoll --drop --jsonArray --file ~/path/to/scrapyoutput.json 我的數據庫被填充(只要我屈服並不返回項目)...我真的很想讓這條管道工作,但...我如何獲得scrapy管道以填充我的項目的mongodb?

好了,所以這裏是我的代碼:

,這裏是我的蜘蛛

import scrapy 

    from scrapy.selector import Selector 
    from scrapy.loader import ItemLoader 
    from scrapy.spiders import CrawlSpider, Rule 
    from scrapy.linkextractors import LinkExtractor 
    from scrapy.http import HtmlResponse 
    from capstone.items import CapstoneItem 

    class CongressSpider(CrawlSpider): 
     name = "congress" 
     allowed_domains = ["www.congress.gov"] 
     start_urls = [ 
      'https://www.congress.gov/members', 
     ] 
    #creating a rule for my crawler. I only want it to continue to the next page, don't follow any other links. 
    rules = (Rule(LinkExtractor(allow=(),restrict_xpaths=("//a[@class='next']",)), callback="parse_page", follow=True),) 

    def parse_page(self, response): 
     for search in response.selector.xpath(".//li[@class='compact']"): 
      yield { 
       'member' : ' '.join(search.xpath("normalize-space(span/a/text())").extract()).strip(), 
       'state' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item']/span/text())").extract()).strip(), 
       'District' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][2]/span/text())").extract()).strip(), 
       'party' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][3]/span/text())").extract()).strip(), 
       'Served' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][4]/span//li/text())").extract()).strip(), 
      } 

設置:

BOT_NAME = 'capstone' 

    SPIDER_MODULES = ['capstone.spiders'] 
    NEWSPIDER_MODULE = 'capstone.spiders' 

    ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,} 
    MONGO_URI = 'mongodb://localhost:27017' 
    MONGO_DATABASE = 'congress' 
    ROBOTSTXT_OBEY = True 
    DOWNLOAD_DELAY = 10 

這裏是我的pipeline.py 進口pymongo

from pymongo import MongoClient 
    from scrapy.conf import settings 
    from scrapy.exceptions import DropItem 
    from scrapy import log 

    class MongoDBPipeline(object): 
     collection_name= 'members' 
     def __init__(self, mongo_uri, mongo_db): 
      self.mongo_uri = mongo_uri 
      self.mongo_db = mongo_db 
     @classmethod 
     def from_crawler(cls, crawler): 
      return cls(
       mongo_uri=crawler.settings.get('MONGO_URI') 
       mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') 
      ) 
     def open_spider(self,spider): 
      self.client = pymongo.MongoClient(self.mongo_uri) 
      self.db = self.client[self.mongo_db] 
     def close_spider(self, spider): 
      self.client.close() 
     def process_item(self, item, spider): 
      self.db[self.collection_name].insert(dict(item)) 
      return item 

這裏是items.py 進口scrapy

class CapstoneItem(scrapy.Item): 
     member = scrapy.Field() 
     state = scrapy.Field() 
     District = scrapy.Field() 
     party = scrapy.Field() 
     served = scrapy.Field() 

最後但並非最不重要我的輸出是這樣的:

2017-02-26 20:44:41 [scrapy.core.engine] INFO: Closing spider (finished) 
    2017-02-26 20:44:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 8007, 
    'downloader/request_count': 24, 
    'downloader/request_method_count/GET': 24, 
    'downloader/response_bytes': 757157, 
    'downloader/response_count': 24, 
    'downloader/response_status_count/200': 24, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2017, 2, 27, 4, 44, 41, 767181), 
    'item_scraped_count': 2139, 
    'log_count/DEBUG': 2164, 
    'log_count/INFO': 11, 
    'request_depth_max': 22, 
    'response_received_count': 24, 
    'scheduler/dequeued': 23, 
    'scheduler/dequeued/memory': 23, 
    'scheduler/enqueued': 23, 
    'scheduler/enqueued/memory': 23, 
    'start_time': datetime.datetime(2017, 2, 27, 4, 39, 58, 834315)} 
    2017-02-26 20:44:41 [scrapy.core.engine] INFO: Spider closed (finished) 

所以在我看來,我沒有得到任何錯誤,我的物品被刮掉了。如果我用-o myfile.json運行它,我可以將myfile導入到我的MongoDB中,但管道只是沒有做任何事情!

 mongo 
    MongoDB shell version: 3.2.12 
    connecting to: test 
    Server has startup warnings: 
     2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten]        2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'. 
    2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] **  We suggest setting it to 'never' 
    2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] 
    2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'. 
    2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] **  We suggest setting it to 'never' 
    2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] 
    > show dbs 
    congress 0.078GB 
    local  0.078GB 
    > use congress 
    switched to db congress 
    > show collections 
    members 
    system.indexes 
    > db.members.count() 
    0 
    > 

我懷疑我的問題與我的設置文件有關。我是新的scrapy和mongodb,我有一種感覺,我沒有告訴scrapy我的mongodb是正確的。 這裏是我發現了一些其他渠道,我嘗試使用它們作爲例子,但一切我嘗試了導致同樣的結果(刮做,蒙戈是空的) https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/ https://github.com/sebdah/scrapy-mongodb 我有一大堆的多個源,但沒有足夠的聲譽後更不幸。無論如何〜 〜任何想法將不勝感激謝謝。

+0

你有一個錯字在你的'MongoDBPipeline'中:'def open_sipder(self,spider):'should open_spider' – Granitosaurus

+0

opps ...沒有解決問題,但是謝謝! –

+0

也在'MongoDBPipeline(object):from_crawler(cls,crawler):''return cls()'語句的兩個參數應該用逗號分隔。無論這是否是最後一步,我建議http://stackoverflow.com/questions/299704/what-are-good-ways-to-make-my-python-code-run-first-time和http://stackoverflow.com/questions/1623039/python-debugging-tips有關編寫python腳本時基本測試/調試的一些提示。 – thanasisp

回答

1

我評論了我的一行代碼,說

ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}

和我註釋掉的代碼,已經內的設置文件

ITEM_PIPLINES = { 'capstone.pipelines.MongoDBPipeline': 300, }

唯一的區別我可以行看到的是換行符,這個設置遠遠低於我的其他設置。 得到這個工作後,我開始得到有關我的管道文件中的拼寫錯誤的python錯誤。我想通了,正在刮在我的項目我的管道並沒有因爲輸出的連接:

[scrapy.middleware] INFO: Enabled item pipelines:[]

改變了我的設置,我得到這個後:

[scrapy.middleware] INFO: Enabled item piplines:['capstone.pipelines.MongoDBPipeline']

0

錯字你在哪裏設置DB名稱:

mongo_db=crawer.settings.get('MONGO_DATABASE', 'items') 

應該

mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') 

希望的作品!

+0

謝謝!仍然沒有在我的數據庫中:/ –

相關問題