1
我想設計一個抓取flipkart數據的網絡爬蟲。我正在使用mongoDB來存儲數據。我的代碼如下:Scrapy扔URL錯誤
WebSpider.py
from scrapy.spider import CrawlSpider
from scrapy.selector import Selector
from spider_web.items import SpiderWebItem
class WebSpider(CrawlSpider):
name = "spider_web"
allowed_domains = ["http://www.flipkart.com"]
start_urls = [
"http://www.flipkart.com/search?q=amish+tripathi",
]
def parse(self, response):
books = response.selector.xpath(
'//div[@class="old-grid"]/div[@class="gd-row browse-grid-row"]')
for book in books:
item = SpiderWebItem()
item['title'] = book.xpath(
'.//div[@class="pu-details lastUnit"]/div[@class="pu-title fk-font-13"]/a[contains(@href, "from-search")]/@title').extract()[0].strip()
item['rating'] = book.xpath(
'.//div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/div[1]/@title').extract()[0]
item['noOfRatings'] = book.xpath(
'.//div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/text()').extract()[1].strip()
item['url'] = response.url
yield item
items.py
from scrapy.item import Item, Field
class SpiderWebItem(Item):
url = Field()
title = Field()
rating = Field()
noOfRatings = Field()
pipelines.py
import pymongo
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT']
)
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
for data in item:
if not data:
raise DropItem("Missing data!")
self.collection.update({'title': item['title']}, dict(item), upsert=True)
log.msg("book added to MongoDB database!",
level=log.DEBUG, spider=spider)
return item
settings.py BOT_NAME = 'spider_web'
SPIDER_MODULES = ['spider_web.spiders']
NEWSPIDER_MODULE = 'spider_web.spiders'
DOWNLOAD_HANDLERS = {
's3': None,
}
DOWNLOAD_DELAY = 0.25
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
ITEM_PIPELINES = ['spider_web.pipelines.MongoDBPipeline', ]
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "flipkart"
MONGODB_COLLECTION = "books"
我檢查了每個xpath與scrapy外殼。他們正在產生正確的結果。但是start_URL正在拋出。當我運行蜘蛛的錯誤是:
2015-10-05 20:05:10 [scrapy] ERROR: Spider error processing <GET http://www.flipkart.com/search?q=rabindranath+tagore> (
referer: None)
........
File "F:\myP\Web Scraping\spider_web\spider_web\spiders\WebSpider.py", line 21, in parse
'.//div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/div[1]/@title').extract()[0]
IndexError: list index out of range
我在我的智慧在這裏結束。蜘蛛正在爲一個或兩個項目提取數據,然後提出錯誤,蜘蛛正在一起停止。任何幫助將不勝感激。先謝謝你。