2017-05-01 62 views
1

我想用scrapy蜘蛛從以下網站的所有帖子獲取數據(問題標題+內容&答案)的第二級數據:如何使用Scrapy抓取頁面上的

https://forums.att.com/t5/custom/page/page-id/latest-activity/category-id/Customer_Care/page/1?page-type=latest-solutions-topics

問題是我只是不知道如何使它首先按照帖子的鏈接,然後抓取所有15個帖子/網站的數據。

{進口scrapy

類ArticleSpider(scrapy.Spider): 名= 「POST」 start_urls = [ 'https://forums.att.com/t5/Data-Messaging-Features-Internet/Throttling-for-unlimited-data/m-p/4805201#M73235']

def parse(self, response): 
    SET_SELECTOR = 'body' 
    for post in response.css(SET_SELECTOR): 

     # Selector for title, content and answer 
     TITLE_SELECTOR = '.lia-message-subject h5 ::text' 
     CONTENT_SELECTOR = '.lia-message-body-content' 
     ANSWER_SELECTOR = '.lia-message-body-content' 

     yield { 

      # [0].extract() = extract_first() 
      'Qtitle': post.css(TITLE_SELECTOR)[0].extract(), 
      'Qcontent': post.css(CONTENT_SELECTOR)[0].extract(), 
      'Answer': post.css(ANSWER_SELECTOR)[1].extract(), 
     } 
    # Running through all 173 pages 
    NEXT_PAGE_SELECTOR = '.lia-paging-page-next a ::attr(href)' 
    next_page = response.css(NEXT_PAGE_SELECTOR).extract_first() 
    if next_page: 
     yield scrapy.Request(
      response.urljoin(next_page), 
      callback=self.parse 
     )} 

我希望你能幫助我。提前致謝!

回答

1

您需要添加一個刮取發佈內容的方法。你可以重寫(我使用XPath選擇器)的蜘蛛這樣的代碼:

# -*- coding: utf-8 -*- 
import scrapy 

class ArticleSpider(scrapy.Spider): 
    name = "post" 
    start_urls = ['https://forums.att.com/t5/custom/page/page-id/latest-activity/category-id/Customer_Care/page/1?page-type=latest-solutions-topics'] 

    def parse(self, response): 
     for post_link in response.xpath('//h2//a/@href').extract(): 
      link = response.urljoin(post_link) 
      yield scrapy.Request(link, callback=self.parse_post) 

     # Checks if the main page has a link to next page if True keep parsing. 
     next_page = response.xpath('(//a[@rel="next"])[1]/@href').extract_first() 
     if next_page: 
      yield scrapy.Request(next_page, callback=self.parse) 

    def parse_post(self, response): 
     # Scrape title, content from post. 
     for post in response.xpath('//div[contains(@class, "lia-quilt-forum-message")]'): 
      item = dict() 
      item['title'] = post.xpath('.//h5/text()').extract_first() 
      item['content'] = post.xpath('.//div[@class="lia-message-body-content"]//text()').extract() 
      yield item 

     # If the post page has a link to next page keep parsing. 
     next_page = response.xpath('(//a[@rel="next"])[1]/@href').extract_first() 
     if next_page: 
      yield scrapy.Request(next_page, callback=self.parse_post) 

此代碼解析從主網頁的所有鏈接,並呼籲parse _post方法刮每篇文章的內容。 parseparse_post方法都會檢查是否存在下一個鏈接,並且如果True繼續進行刮取。