我如何獲得scrapy管道以填充我的物品的mongodb?這是我的代碼現在看起來是什麼,這反映了我從scrapy文檔中獲得的信息。 我也想提一下,我已經嘗試了返回的物品而不是屈服,以及嘗試使用物品裝載機。所有的方法似乎都有相同的結果。 在那個筆記我想提到的是,如果我運行命令 mongoimport --db mydb --collection mycoll --drop --jsonArray --file ~/path/to/scrapyoutput.json
我的數據庫被填充(只要我屈服並不返回項目)...我真的很想讓這條管道工作,但...我如何獲得scrapy管道以填充我的項目的mongodb?
好了,所以這裏是我的代碼:
,這裏是我的蜘蛛
import scrapy
from scrapy.selector import Selector
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse
from capstone.items import CapstoneItem
class CongressSpider(CrawlSpider):
name = "congress"
allowed_domains = ["www.congress.gov"]
start_urls = [
'https://www.congress.gov/members',
]
#creating a rule for my crawler. I only want it to continue to the next page, don't follow any other links.
rules = (Rule(LinkExtractor(allow=(),restrict_xpaths=("//a[@class='next']",)), callback="parse_page", follow=True),)
def parse_page(self, response):
for search in response.selector.xpath(".//li[@class='compact']"):
yield {
'member' : ' '.join(search.xpath("normalize-space(span/a/text())").extract()).strip(),
'state' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item']/span/text())").extract()).strip(),
'District' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][2]/span/text())").extract()).strip(),
'party' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][3]/span/text())").extract()).strip(),
'Served' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][4]/span//li/text())").extract()).strip(),
}
設置:
BOT_NAME = 'capstone'
SPIDER_MODULES = ['capstone.spiders']
NEWSPIDER_MODULE = 'capstone.spiders'
ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'congress'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 10
這裏是我的pipeline.py 進口pymongo
from pymongo import MongoClient
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class MongoDBPipeline(object):
collection_name= 'members'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI')
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self,spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert(dict(item))
return item
這裏是items.py 進口scrapy
class CapstoneItem(scrapy.Item):
member = scrapy.Field()
state = scrapy.Field()
District = scrapy.Field()
party = scrapy.Field()
served = scrapy.Field()
最後但並非最不重要我的輸出是這樣的:
2017-02-26 20:44:41 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-26 20:44:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 8007,
'downloader/request_count': 24,
'downloader/request_method_count/GET': 24,
'downloader/response_bytes': 757157,
'downloader/response_count': 24,
'downloader/response_status_count/200': 24,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 27, 4, 44, 41, 767181),
'item_scraped_count': 2139,
'log_count/DEBUG': 2164,
'log_count/INFO': 11,
'request_depth_max': 22,
'response_received_count': 24,
'scheduler/dequeued': 23,
'scheduler/dequeued/memory': 23,
'scheduler/enqueued': 23,
'scheduler/enqueued/memory': 23,
'start_time': datetime.datetime(2017, 2, 27, 4, 39, 58, 834315)}
2017-02-26 20:44:41 [scrapy.core.engine] INFO: Spider closed (finished)
所以在我看來,我沒有得到任何錯誤,我的物品被刮掉了。如果我用-o myfile.json運行它,我可以將myfile導入到我的MongoDB中,但管道只是沒有做任何事情!
mongo
MongoDB shell version: 3.2.12
connecting to: test
Server has startup warnings:
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] 2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten]
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten]
> show dbs
congress 0.078GB
local 0.078GB
> use congress
switched to db congress
> show collections
members
system.indexes
> db.members.count()
0
>
我懷疑我的問題與我的設置文件有關。我是新的scrapy和mongodb,我有一種感覺,我沒有告訴scrapy我的mongodb是正確的。 這裏是我發現了一些其他渠道,我嘗試使用它們作爲例子,但一切我嘗試了導致同樣的結果(刮做,蒙戈是空的) https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/ https://github.com/sebdah/scrapy-mongodb 我有一大堆的多個源,但沒有足夠的聲譽後更不幸。無論如何〜 〜任何想法將不勝感激謝謝。
你有一個錯字在你的'MongoDBPipeline'中:'def open_sipder(self,spider):'should open_spider' – Granitosaurus
opps ...沒有解決問題,但是謝謝! –
也在'MongoDBPipeline(object):from_crawler(cls,crawler):''return cls()'語句的兩個參數應該用逗號分隔。無論這是否是最後一步,我建議http://stackoverflow.com/questions/299704/what-are-good-ways-to-make-my-python-code-run-first-time和http://stackoverflow.com/questions/1623039/python-debugging-tips有關編寫python腳本時基本測試/調試的一些提示。 – thanasisp