2017-04-06 56 views
-1

我試圖抓取this website。正如你所看到的,上面的URL有10個名字(這是鏈接)(Alex,Michele等)使用SCRAPY在多個鏈接中提取數據

import scrapy 

class Italy1Spider(scrapy.Spider): 
    name = "italyspider" 

    def start_requests(self): 
     urls = [ 
      'http://www.odceccastrovillari.it/portale/albo_vista?page=1&field_cognome_value=&field_nome_value=', 
     ] 
     for url in urls: 
      yield scrapy.Request(url=url, callback=self.parse) 

    def parse(self, response): 
     page = response.url.split("/")[-2] 
     filename = 'italy2-%s.html' % page 
     with open(filename, 'wb') as f: 
      f.write(response.body) 
     self.log('Saved file %s' % filename) 

我寫了上面的代碼。有了這個,我可以輸入response.css('a::text').extract()來接收上面提到的10個名字。

但是,我需要每個鏈接名稱中包含的電子郵件地址。我也需要爲所有的頁面而不是上面顯示的頁面。

我需要添加到我的代碼來實現這個目標?我嘗試了各種各樣的東西,但似乎無法讓它工作。

任何幫助表示讚賞!

回答

0

您需要逐個抓取每個頁面。所以首先連接到房源頁面,然後找到所有的URL和檢索到這些個人資料(姓名,電子郵件等)

class Italy1Spider(scrapy.Spider): 
    name = "italyspider" 
    start_urls = ['http://www.odceccastrovillari.it/portale/albo_vista?page=1&field_cognome_value=&field_nome_value='] 

    def parse(self, response): 
     # find all urls that point to people 
     people_urls = response.css('.view-albo td a::attr(href)').extract() 
     people_urls = list(set(people_urls)) # make unique 
     for url in people_urls: 
      # got to every persons page 
      yield Request(url, self.parse_person) 

    def parse_person(self, response): 
     # parse some stuff here 
     # to find email you need to find node with Email text and then you can 
     # navigate to td node that contains the emails: 
     response.xpath("//th[contains(text(),'Email')]/following-sibling::td/text()").extract() 
     #[u'[email protected]', u'[email protected]'] 
+0

我如何刮從這個頁面雖然電子郵件? –

+0

我不明白上面的代碼如何幫助獲取電子郵件? –

+0

@NickHawthorne我在回答中包含了電子郵件的xpath。 – Granitosaurus