scrapy用scrapy（和硒？）動態生成的數據

我正在努力獲得scrapy（帶或不帶硒）從網頁中提取動態生成的內容。該網站列出了不同大學的表現，並允許您選擇該單位提供的每個研究區域。例如，從下面代碼中列出的頁面中，我希望能夠提取大學名稱（「Bond University」）和「總體體驗質量」（91.3％）的值。scrapy用scrapy（和硒？）動態生成的數據

但是，當我使用'視圖源'，捲曲或scrapy時，實際值不顯示。例如。在這裏我希望看到統一的名稱，它表明：

<h1 class="inline-block instiution-name" data-bind="text: Description"></h1>

但如果我使用Firebug或Chrome檢查元素，它顯示了

<h1 class="inline-block instiution-name" data-bind="text: Description">Bond University</h1>

在進一步的檢查，對「網」在firebug中，我可以看到有一個AJAX（？）調用正在返回相關信息，但我還沒有能夠在scrapy中模擬它，甚至沒有發現curl（是的，我搜索並花費了一段令人尷尬的長時間我害怕）。與請求

{"InstitutionId":20,"StudyAreaId":0}

作爲第二選擇通過

請求頭

POST /Websilk/DataServices/SurveyData.asmx/FetchInstitutionStudyAreaData HTTP/1.1 
Host: www.qilt.edu.au 
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0 
Accept: application/json, text/javascript, */*; q=0.01 
Accept-Language: en-US,en;q=0.5 
Accept-Encoding: gzip, deflate 
Content-Type: application/json; charset=utf-8 
X-Requested-With: XMLHttpRequest 
Referer: http://www.qilt.edu.au/institutions/institution/bond-university/business-management 
Content-Length: 36 
Cookie: _ga=GA1.3.69062787.1442441726; ASP.NET_SessionId=lueff4ysg3yvd2csv5ixsc1f; _gat=1 
Connection: keep-alive 
Pragma: no-cache 
Cache-Control: no-cache

POST參數，我嘗試使用硒與scrapy，因爲我認爲這可能「看到」真正的價值，像瀏覽器一樣，但無濟於事。我到目前爲止的主要嘗試如下：

import scrapy 
import time #used for the sleep() function 

from selenium import webdriver 

class QiltSpider(scrapy.Spider): 
    name = "qilt" 

    allowed_domains = ["qilt.edu.au"] 
    start_urls = [ 
     "http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/" 
    ] 

    def __init__(self): 
     self.driver = webdriver.Firefox() 
     self.driver.get('http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/') 
     time.sleep(5) # tried pausing, in case problem was delayed loading - didn't work 

    def parse(self, response): 
     # parse the response to find the uni name and show in console (using xpath code from firebug). This find the relevant section, but it shows as empty 
     title = response.xpath('//*[@id="bd"]/div[2]/div/div/div[1]/div/div[2]/h1').extract() 
     print title 
     # dumping the whole response to a file so I can check whether dynamic values were captured 
     with open("extract.html", 'wb') as f: 
      f.write(response.body) 
      self.driver.close()

任何人都可以告訴我怎麼做到這一點？

非常感謝！

編輯：感謝您的建議到目前爲止，但任何想法如何具體模仿AJAX調用與參數的InstitutionID和StudyAreaID？我的代碼來測試這個如下，但它似乎仍然打錯誤頁面。

import scrapy 
from scrapy.http import FormRequest 

class HeaderTestSpider(scrapy.Spider): 
    name = "headerTest" 

    allowed_domains = ["qilt.edu.au"] 
    start_urls = [ 
     "http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/" 
    ] 

    def parse(self, response): 
     return [FormRequest(url="http://www.qilt.edu.au/Websilk/DataServices/SurveyData.asmx/FetchInstitutionData", 
          method='POST', 
          formdata={'InstitutionId':'20', 'StudyAreaId': '0'}, 
          callback=self.parser2)]

來源

2015-09-17 Tango delta

您可以使用[請求]（http://www.python-requests.org/en/latest/）庫並模仿正在進行的* AJAX *調用。 –

由於Scrapy正在使用中，因此不需要「請求」。 – alecxe

爲什麼不使用Selenium並在瀏覽器中呈現頁面後將數據從頁面中刪除？ – JeffC

QILT頁面使用AJAX從服務器檢索數據。這個AJAX請求是使用一個javascript代碼發送的，該代碼使用even document.ready（jQuery）/window.onload（Javascript）來觸發（如果您不熟悉javascript，只要網頁加載完成就會觸發此方法瀏覽器窗口）。由於您使用的是軟件來激發頁面請求，因此此事件根本不會被解僱。

對於您嘗試模擬的AJAX請求，請求主體是Application/JSON類型。請將下列標題添加到請求中。 內容類型：應用程序/ json

來源

2015-09-19 14:21:26

scrapy用scrapy（和硒？）動態生成的數據

回答

相關問題