當Scrapy遍歷頁面上找到的hrefs列表時,爲什麼它開始在列表中間的某個位置顯示抓取的項目而不是第一個href?當Scrapy遍歷hrefs列表時,爲什麼刮取的項目沒有按順序顯示?
我從本頁找到的鏈接列表中提取狀態庫信息:http://www.publiclibraries.com/。
我使用XPath是這樣的:
//div/div/div/table/tr/td/a/@href
的代碼看起來工作正常,但我不知道爲什麼,顯示刮項目時,Scrapy似乎開始與肯塔基州,路易斯安那州,密西西比州或密蘇里州。它實際上首先顯示哪一個不一致,但最終會顯示所有狀態(只是不按頁面上的順序顯示)。
爲什麼它不是從Alamabama開始的?這是否與線程有關?如果是這樣,是否有辦法迫使Scrapy按照它們出現在初始頁面上的順序顯示它們?
蜘蛛代碼:
import scrapy
import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured
from tutorial.items import LibAddressItem
class DmozSpider(scrapy.Spider):
name = "us-pub-lib-physical_addresses"
allowed_domains = ["publiclibraries.com"]
start_urls = [
"http://www.publiclibraries.com/"
]
def parse(self, response):
print "#################################################################"
print response.url
print "Top level states list"
print "#################################################################"
for href in response.xpath("//div/div/div/table/tr/td/a/@href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_state_libs)
count = 0
def parse_state_libs(self, response):
print "#################################################################"
print response.url
print "#################################################################"
for sel in response.xpath('//div/div/div/table/tr'):
item = LibAddressItem()
item['city'] = sel.xpath('td[1]/text()').extract()
item['library'] = sel.xpath('td[2]/text()').extract()
item['address'] = sel.xpath('td[3]/text()').extract()
item['zip_code'] = sel.xpath('td[4]/text()').extract()
item['phone'] = sel.xpath('td[5]/text()').extract()
self.count = self.count + 1
yield item
print "#####################################"
print "The number of libraries found so far:"
print self.count
print "#####################################"
LibAddressItem:
import scrapy
class LibAddressItem(scrapy.Item):
city = scrapy.Field()
state = scrapy.Field()
library = scrapy.Field()
address = scrapy.Field()
zip_code = scrapy.Field()
phone = scrapy.Field()
的最初實例顯示的項目:
2015-11-19 13:59:57 [scrapy] DEBUG: Crawled (200) <GET http://www.publiclibraries.com/> (referer: None)
#################################################################
http://www.publiclibraries.com/
Top level states list
#################################################################
2015-11-19 13:59:58 [scrapy] DEBUG: Crawled (200) <GET http://www.publiclibraries.com/kentucky.htm> (referer: http://www.publiclibraries.com/)
#################################################################
http://www.publiclibraries.com/kentucky.htm
#################################################################
2015-11-19 13:59:58 [scrapy] DEBUG: Scraped from <200 http://www.publiclibraries.com/kentucky.htm>
{'address': [], 'city': [], 'library': [], 'phone': [], 'zip_code': []}
2015-11-19 13:59:58 [scrapy] DEBUG: Scraped from <200 http://www.publiclibraries.com/kentucky.htm>
{'address': [u'302 King Drive'],
'city': [u'Albany'],
'library': [u'Clinton County Public Library'],
'phone': [u'(606) 387-5989'],
'zip_code': [u'42602']}
2015-11-19 13:59:58 [scrapy] DEBUG: Scraped from <200 http://www.publiclibraries.com/kentucky.htm>
{'address': [u'1740 Central Avenue'],
'city': [u'Ashland'],
'library': [u'Boyd County Public Library'],
'phone': [u'(606) 329-0090'],
'zip_code': [u'41101']}
2015-11-19 13:59:58 [scrapy] DEBUG: Scraped from <200 http://www.publiclibraries.com/kentucky.htm>
{'address': [u'1016 Summit Road'],
'city': [u'Ashland'],
'library': [u'Summit Branch'],
'phone': [u'(606) 928-3366'],
'zip_code': []}
這正是我所期待的。我是Scrapy的新手,顯然 - 我認爲我的下一步是真正理解meta,因爲這是我想要的解決方案的一部分。謝謝。 – ryan71
對它的幫助感到滿意,'meta'只是如何在回調之間進行通信的方式。 – eLRuLL