我遇到與Scrapy收集的數據有關的問題。看來,當我運行這段代碼的終端,所收集的全部追加到一個項目看起來的信息,如:Scrapy - 如何避免將收集的信息分組到一個項目
{"fax": ["Fax: 617-638-4905", "Fax: 925-969-1795", "Fax: 913-327-1491", "Fax: 507-281-0291", "Fax: 509-547-1265", "Fax: 310-437-0585"],
"title": ["Challenges in Musculoskeletal Rehabilitation", "17th Annual Spring Conference on Pediatric Emergencies", "19th Annual Association of Professors of Human & Medical Genetics (APHMG) Workshop & Special Interest Groups Meetings", "2013 AMSSM 22nd Annual Meeting", "61st Annual Meeting of Pacific Coast Reproductive Society (PCRS)", "Contraceptive Technology Conference 25th Anniversary", "Mid-America Orthopaedic Association 2013 Meeting", "Pain Management", "Peripheral Vascular Access Ultrasound", "SAGES 2013/ISLCRS 8th International Congress"], ... ...
...等
的問題是,所有的每個領域的信息被截取在一個項目中。我需要這些信息作爲單獨的項目出來。換句話說,我需要每個標題與相關聯一個傳真號碼(如果存在)和一個位置等。
我不希望所有信息都顯示在一起,因爲收集的每條信息都與其他信息有一定的關係。我最終希望它進入數據庫的方式如下:
「MedEconItem」1:[title:「在此插入標題1」,傳真:「在此插入傳真#1」,位置:「位置1」 ...]
「MedEconItem」 2:[標題: 「標題2」,傳真: 「傳真#2」,位置: 「位置2」 ...]
「MedEconItem」 3:[。 ..等等
有關如何解決這個問題的任何想法?有人知道如何輕鬆分離這些信息嗎?這是我第一次使用Scrapy,因此歡迎任何建議。我到處尋找,我似乎無法找到答案。
這是目前我的代碼:
import scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class MedEconItem(Item):
title = Field()
date = Field()
location = Field()
specialty = Field()
contact = Field()
phone = Field()
fax = Field()
email = Field()
url = Field()
class autoupdate(BaseSpider):
name = "medecon"
allowed_domains = ["www.doctorsreview.com"]
start_urls = [
"http://www.doctorsreview.com/meetings/search/?region=united-states&destination=all&specialty=all&start=YYYY-MM-DD&end=YYYY-MM-DD",
]
def serialize_field(self, field, name, value):
if field == '':
return super(MedEconItem, self).serialize_field(field, name, value)
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html/body/div[@id="c"]/div[@id="meeting_results"]')
items = []
for site in sites:
item = MedEconItem()
item['title'] = site.select('//h3/a/text()').extract()
item['date'] = site.select('//p[@class = "dls"]/span[@class = "date"]/text()').extract()
item['location'] = site.select('//p[@class = "dls"]/span[@class = "location"]/a/text()').extract()
item['specialty'] = site.select('//p[@class = "dls"]/span[@class = "specialties"]/text()').extract()
item['contact'] = site.select('//p[@class = "contact"]/text()').extract()
item['phone'] = site.select('//p[@class = "phone"]/text()').extract()
item['fax'] = site.select('//p[@class = "fax"]/text()').extract()
item['email'] = site.select('//p[@class = "email"]/text()').extract()
item['url'] = site.select('//p[@class = "website"]/a/@href').extract()
items.append(item)
return item
我試過這個代碼,但它引發了一個NotImplementedError。它表示它抓取了網站,但它說GET在GET說錯誤:錯誤:Spider錯誤處理
knn360
2013-05-01 20:01:04
這很奇怪。你正在使用什麼版本的scrapy? – Talvalin 2013-05-02 07:23:38
我正在使用Scrapy 0.16.4 – knn360 2013-05-03 02:22:33