2017-09-26 57 views
0

我想要抓取township directory of China。該網站分爲4個層次,分別爲省頁面,城市頁面,縣頁面和鄉鎮頁面。例如,在省份頁面上列出了所有省份。如果我們點擊一​​個省份的鏈接,那麼它會將我們帶到城市頁面,並顯示該省的城市列表。Scrapy - 每個項目抓取4級頁面,不能先深入

我希望我的每件物品都是鄉鎮。它包括town_name,town_id(gbcode)和相應的縣名,city_name,prov_name。所以當蜘蛛進入鄉鎮頁面時,蜘蛛應該收集信息。但是,我目前使用for循環的方法似乎不起作用。 prov_name沒有問題。但市縣名稱大多不正確,它們始終是其對應頁面列表中的最後一個城市/縣。我認爲問題在於蜘蛛不夠深,只能在循環結束時進入parse_county請求。但是,改變設置中的深度優先級並不能解決問題。

---------- Sample Result -------- 
town_name, year, gbcode, city, province, county 
建國門街道辦事處,2016,110101008000,市轄區,北京市,延慶區 
東直門街道辦事處,2016,110101009000,市轄區,北京市,延慶區 
和平里街道辦事處,2016,110101010000,市轄區,北京市,延慶區 
前門街道辦事處,2016,110101011000,市轄區,北京市,延慶區 
崇文門外街道辦事處,2016,110101012000,市轄區,北京市,延慶區 



import scrapy 
import re 
from scrapy.spiders import Spider 
from admincode.items import AdmincodeItem 

class StatsSpider(Spider): 
    name = 'stats' 
    allowed_domains = ['stats.gov.cn'] 
    start_urls = [ 
     'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/{}/index.html'.format(year) for year in range(2009, 2010)] 

    def parse(self, response): 
     for item in self.parse_provincetr(response, response.selector.css(".provincetr")): 
      yield item 

    def get_text_href(self, td): 
     if not td.xpath('a'): 
      return td.xpath('text()').extract()[0], None 
     else: 
      return td.xpath('a/text()').extract()[0], td.xpath('a/@href').extract()[0] 

    def parse_provincetr(self, response, trs): 
     year_pattern = re.compile('(tjyqhdmhcxhfdm/)([0-9][0-9][0-9][0-9])') 
     year = year_pattern.search(response.url).group(2) 
     for td in trs.xpath('td'): 
      scraped = {} 
      scraped['year'] = year 
      scraped['prov_name'], href = self.get_text_href(td) 
      url = response.urljoin(href) 
      yield scrapy.Request(url, callback=self.parse_citytr, 
           meta={'scraped': scraped}) 

    def parse_2td(self, response, trs, var_name, nextparse): 
     for tr in trs: 
      scraped = response.meta['scraped'] 
      scraped[var_name], href = self.get_text_href(tr.xpath('td')[1]) 
      if nextparse: 
       url = response.urljoin(href) 
       yield scrapy.Request(url, callback=nextparse, meta={'scraped': scraped}) 
      else: 
       item = AdmincodeItem() 
       item['year'] = scraped['year'] 
       item['prov_name'] = scraped['prov_name'] 
       item['city_name'] = scraped['city_name'] 
       item['county_name'] = scraped['county_name'] 
       item['town_name'] = scraped['town_name'] 
       item['gbcode'], href = self.get_text_href(
        tr.xpath('td')[0]) 
       yield item 

    def parse_citytr(self, response): 
     for city in self.parse_2td(response, response.selector.css(".citytr"), 'city_name', self.parse_countytr): 
      yield city 

    def parse_countytr(self, response): 
     for county in self.parse_2td(response, response.selector.css(".countytr"), 'county_name', self.parse_towntr): 
      yield county 

    def parse_towntr(self, response): 
     for town in self.parse_2td(response, response.selector.css(".towntr"), 'town_name', None): 
      yield town 
+0

是的,就像這樣。 –

回答

0

我想你只是讓事情有點複雜。這是一個簡單的刮板,你需要做的是使用元數據將信息從一個頁面傳遞到另一個頁面。由於meta是內存中的字典,我們需要確保爲項目創建信息的副本。爲此,我們使用copy.deepcopy。這將確保數據生成項目之前不會被覆蓋

下面是刮板這確實是

class StatsSpider(Spider): 
    name = 'stats' 
    allowed_domains = ['stats.gov.cn'] 
    start_urls = [ 
     'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/{}/index.html'.format(year) for year in range(2009, 2010)] 

    def parse(self, response): 
     for item in response.css(".provincetr a"): 
      name = item.xpath("./text()").extract_first().strip() 
      link = item.xpath("./@href").extract_first().strip() 
      yield response.follow(link, callback=self.parse_province, meta={'item':{'province':name}}) 

    def parse_province(self, response): 
     meta = response.meta['item'] 

     for cityrow in response.css(".citytr"): 
      city_link = cityrow.xpath("./td[2]/a/@href").extract_first() 
      city_name = cityrow.xpath("./td[2]/a/text()").extract_first() 
      city_code = cityrow.xpath("./td[1]/a/text()").extract_first() 

      meta_new = deepcopy(meta) 

      meta_new['city_name'] = city_name 
      meta_new['city_code'] = city_code 

      yield response.follow(city_link, callback=self.parse_city, meta = {'item':meta_new}) 

    def parse_city(self, response): 

     meta = response.meta['item'] 

     for countyrow in response.css(".countytr"): 
      county_link = countyrow.xpath("./td[2]/a/@href").extract_first() 
      county_name = countyrow.xpath("./td[2]/a/text()").extract_first() 
      county_code = countyrow.xpath("./td[1]/a/text()").extract_first() 

      meta_new = deepcopy(meta) 

      meta_new['county_name'] = county_name 
      meta_new['county_code'] = county_code 

      yield response.follow(county_link, callback=self.parse_county, meta = {"item": meta_new}) 

    def parse_county(self, response): 

     meta = response.meta['item'] 

     for townrow in response.css(".towntr"): 
      town_link = townrow.xpath("./td[2]/a/@href").extract_first() 
      town_name = townrow.xpath("./td[2]/a/text()").extract_first() 
      town_code = townrow.xpath("./td[1]/a/text()").extract_first() 

      meta_new = deepcopy(meta) 

      meta_new['town_name'] = town_name 
      meta_new['town_code'] = town_code 

      yield meta_new