使用其他網址登錄後對網頁進行爬網

-1

我對scrapy的知識有限。有了這段代碼，我可以在特定的論壇上進行登錄。現在我需要登錄之後另設網址：使用其他網址登錄後對網頁進行爬網

https://forum.xxx.com/threads/topic-name/page-300

我想與自動化300-360之間的頁面範圍內的爬行。具體而言，所有元素messageText

我該怎麼做？

import scrapy 

class LoginSpider(scrapy.Spider): 
    name = 'xxx.com' 
    start_urls = ['https://forum.xxx.com/login/login'] 

    def parse(self, response): 
     return scrapy.FormRequest.from_response(
      response, 
      formdata={'login': 'xxx', 'register': '0', 'password': 'xxxxx', 'cookie_check': '0'}, 
      callback=self.after_login 
     ) 

    def after_login(self, response): 
     # check login succeed before going on 
     if "authentication failed" in response.body: 
      self.logger.error("Login failed") 
      return 

     # continue scraping with authenticated session...

來源

2017-04-12 anvd

一旦登錄，你只需要得到儘可能多的要求：

from scrapy import Request 
def after_login(self, response): 
    # check login succeed before going on 
    if "authentication failed" in response.body: 
     self.logger.error("Login failed") 
     return 
    for i in range(300, 360): 
     url = 'https://forum.xxx.com/threads/topic-name/page-{}'.format(i) 
     yield Request(url, self.parse_page) 


def parse_page(self, response): 
    # parse page here

來源

2017-04-12 13:45:58 Granitosaurus

@anvd是，請求是一類scrapy的。查看我的編輯以獲取正確的導入。 – Granitosaurus

好吧，現在它可以正常工作，但使用DEBUG：Crawled（403）錯誤，即使在設置用戶代理之後。 'headers = {'User-Agent'：'Mozilla/5.0（X11; Linux x86_64; rv：48.0）Gecko/20100101 Firefox/48.0'} yield request（url = url，callback = self.parse_page，headers = headers） ' – anvd

可能有很多事情導致403響應 - 這意味着你被網站拒絕了。可能是因爲您嘗試抓取的網址格式不正確，或者您的請求存在問題;你有沒有嘗試在瀏覽器中打開它？這是一個全新的問題，需要相當多的調試，你應該打開一個新的問題，並注意你正在爬行的網站，因爲每個網站都以不同的方式處理。 – Granitosaurus

使用其他網址登錄後對網頁進行爬網

回答

相關問題