2015-09-17 41 views
2

我正在努力獲得scrapy(帶或不帶硒)從網頁中提取動態生成的內容。該網站列出了不同大學的表現,並允許您選擇該單位提供的每個研究區域。例如,從下面代碼中列出的頁面中,我希望能夠提取大學名稱(「Bond University」)和「總體體驗質量」(91.3%)的值。scrapy用scrapy(和硒?)動態生成的數據

但是,當我使用'視圖源',捲曲或scrapy時,實際值不顯示。例如。在這裏我希望看到統一的名稱,它表明:

<h1 class="inline-block instiution-name" data-bind="text: Description"></h1> 

但如果我使用Firebug或Chrome檢查元素,它顯示了

<h1 class="inline-block instiution-name" data-bind="text: Description">Bond University</h1> 

在進一步的檢查,對「網」在firebug中,我可以看到有一個AJAX(?)調用正在返回相關信息,但我還沒有能夠在scrapy中模擬它,甚至沒有發現curl(是的,我搜索並花費了一段令人尷尬的長時間我害怕)。與請求

{"InstitutionId":20,"StudyAreaId":0} 

作爲第二選擇通過

請求頭

POST /Websilk/DataServices/SurveyData.asmx/FetchInstitutionStudyAreaData HTTP/1.1 
Host: www.qilt.edu.au 
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0 
Accept: application/json, text/javascript, */*; q=0.01 
Accept-Language: en-US,en;q=0.5 
Accept-Encoding: gzip, deflate 
Content-Type: application/json; charset=utf-8 
X-Requested-With: XMLHttpRequest 
Referer: http://www.qilt.edu.au/institutions/institution/bond-university/business-management 
Content-Length: 36 
Cookie: _ga=GA1.3.69062787.1442441726; ASP.NET_SessionId=lueff4ysg3yvd2csv5ixsc1f; _gat=1 
Connection: keep-alive 
Pragma: no-cache 
Cache-Control: no-cache 

POST參數,我嘗試使用硒與scrapy,因爲我認爲這可能「看到」真正的價值,像瀏覽器一樣,但無濟於事。我到目前爲止的主要嘗試如下:

import scrapy 
import time #used for the sleep() function 

from selenium import webdriver 

class QiltSpider(scrapy.Spider): 
    name = "qilt" 

    allowed_domains = ["qilt.edu.au"] 
    start_urls = [ 
     "http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/" 
    ] 

    def __init__(self): 
     self.driver = webdriver.Firefox() 
     self.driver.get('http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/') 
     time.sleep(5) # tried pausing, in case problem was delayed loading - didn't work 

    def parse(self, response): 
     # parse the response to find the uni name and show in console (using xpath code from firebug). This find the relevant section, but it shows as empty 
     title = response.xpath('//*[@id="bd"]/div[2]/div/div/div[1]/div/div[2]/h1').extract() 
     print title 
     # dumping the whole response to a file so I can check whether dynamic values were captured 
     with open("extract.html", 'wb') as f: 
      f.write(response.body) 
      self.driver.close() 

任何人都可以告訴我怎麼做到這一點?

非常感謝!

編輯:感謝您的建議到目前爲止,但任何想法如何具體模仿AJAX調用參數的InstitutionID和StudyAreaID?我的代碼來測試這個如下,但它似乎仍然打錯誤頁面。

import scrapy 
from scrapy.http import FormRequest 

class HeaderTestSpider(scrapy.Spider): 
    name = "headerTest" 

    allowed_domains = ["qilt.edu.au"] 
    start_urls = [ 
     "http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/" 
    ] 

    def parse(self, response): 
     return [FormRequest(url="http://www.qilt.edu.au/Websilk/DataServices/SurveyData.asmx/FetchInstitutionData", 
          method='POST', 
          formdata={'InstitutionId':'20', 'StudyAreaId': '0'}, 
          callback=self.parser2)] 
+0

您可以使用[請求](http://www.python-requests.org/en/latest/)庫並模仿正在進行的* AJAX *調用。 –

+0

由於Scrapy正在使用中,因此不需要「請求」。 – alecxe

+0

爲什麼不使用Selenium並在瀏覽器中呈現頁面後將數據從頁面中刪除? – JeffC

回答

1

QILT頁面使用AJAX從服務器檢索數據。這個AJAX請求是使用一個javascript代碼發送的,該代碼使用even document.ready(jQuery)/window.onload(Javascript)來觸發(如果您不熟悉javascript,只要網頁加載完成就會觸發此方法瀏覽器窗口)。由於您使用的是軟件來激發頁面請求,因此此事件根本不會被解僱。

對於您嘗試模擬的AJAX請求,請求主體是Application/JSON類型。 請將下列標題添加到請求中。 內容類型:應用程序/ json