0
我是heroku pg的新聞。我在這裏做的是我寫了一個scrapy crawler,它運行時沒有任何錯誤。問題是我想把所有抓取的數據放到我的heroku postgres數據庫中。爲此,我稍微遵循this tutorial從scrapy spider載入heroku pg數據庫中的抓取數據。
當我使用scrapy crawl spidername
在本地機器上運行爬網程序時,它運行成功,但未插入抓取的數據既沒有在heroku數據庫上創建任何表。我甚至沒有在本地終端上發生任何錯誤。這是我的代碼是什麼?
settings.py
BOT_NAME = 'crawlerconnectdatabase'
SPIDER_MODULES = ['crawlerconnectdatabase.spiders']
NEWSPIDER_MODULE = 'crawlerconnectdatabase.spiders'
DATABASE = {'drivername': 'postgres',
'host': 'ec2-54-235-250-41.compute-1.amazonaws.com',
'port': '5432',
'username': 'dtxwjcycsaweyu',
'password': '***',
'database': 'ddcir2p1u2vk07'}
items.py
from scrapy.item import Item, Field
class CrawlerconnectdatabaseItem(Item):
name = Field()
url = Field()
title = Field()
link = Field()
page_title = Field()
desc_link = Field()
body = Field()
news_headline = Field()
pass
models.py
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL
import settings
DeclarativeBase = declarative_base()
def db_connect():
return create_engine(URL(**settings.DATABASE))
def create_deals_table(engine):
DeclarativeBase.metadata.create_all(engine)
class Deals(DeclarativeBase):
"""Sqlalchemy deals model"""
__tablename__ = "news_data"
id = Column(Integer, primary_key=True)
body = Column('body', String)
pipelines.py
from sqlalchemy.orm import sessionmaker
from models import Deals, db_connect, create_deals_table
class CrawlerconnectdatabasePipeline(object):
def __init__(self):
engine = db_connect()
create_deals_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
session = self.Session()
deal = Deals(**item)
try:
session.add(deal)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
蜘蛛
爲scrapy蜘蛛的代碼,你會發現它here
好的我會嘗試.... –