使用SCRAPY在多個鏈接中提取數據

-1

我試圖抓取this website。正如你所看到的，上面的URL有10個名字（這是鏈接）（Alex，Michele等）使用SCRAPY在多個鏈接中提取數據

import scrapy 

class Italy1Spider(scrapy.Spider): 
    name = "italyspider" 

    def start_requests(self): 
     urls = [ 
      'http://www.odceccastrovillari.it/portale/albo_vista?page=1&field_cognome_value=&field_nome_value=', 
     ] 
     for url in urls: 
      yield scrapy.Request(url=url, callback=self.parse) 

    def parse(self, response): 
     page = response.url.split("/")[-2] 
     filename = 'italy2-%s.html' % page 
     with open(filename, 'wb') as f: 
      f.write(response.body) 
     self.log('Saved file %s' % filename)

我寫了上面的代碼。有了這個，我可以輸入response.css('a::text').extract()來接收上面提到的10個名字。

但是，我需要每個鏈接名稱中包含的電子郵件地址。我也需要爲所有的頁面而不是上面顯示的頁面。

我需要添加到我的代碼來實現這個目標？我嘗試了各種各樣的東西，但似乎無法讓它工作。

任何幫助表示讚賞！

來源

2017-04-06 Nick Hawthorne

您需要逐個抓取每個頁面。所以首先連接到房源頁面，然後找到所有的URL和檢索到這些個人資料（姓名，電子郵件等）

class Italy1Spider(scrapy.Spider): 
    name = "italyspider" 
    start_urls = ['http://www.odceccastrovillari.it/portale/albo_vista?page=1&field_cognome_value=&field_nome_value='] 

    def parse(self, response): 
     # find all urls that point to people 
     people_urls = response.css('.view-albo td a::attr(href)').extract() 
     people_urls = list(set(people_urls)) # make unique 
     for url in people_urls: 
      # got to every persons page 
      yield Request(url, self.parse_person) 

    def parse_person(self, response): 
     # parse some stuff here 
     # to find email you need to find node with Email text and then you can 
     # navigate to td node that contains the emails: 
     response.xpath("//th[contains(text(),'Email')]/following-sibling::td/text()").extract() 
     #[u'[email protected]', u'[email protected]']

來源

2017-04-06 13:57:11 Granitosaurus

我如何刮從這個頁面雖然電子郵件？ –

我不明白上面的代碼如何幫助獲取電子郵件？ –

@NickHawthorne我在回答中包含了電子郵件的xpath。 – Granitosaurus

使用SCRAPY在多個鏈接中提取數據

回答

相關問題