我正在努力獲得scrapy(帶或不帶硒)從網頁中提取動態生成的內容。該網站列出了不同大學的表現,並允許您選擇該單位提供的每個研究區域。例如,從下面代碼中列出的頁面中,我希望能夠提取大學名稱(「Bond University」)和「總體體驗質量」(91.3%)的值。scrapy用scrapy(和硒?)動態生成的數據
但是,當我使用'視圖源',捲曲或scrapy時,實際值不顯示。例如。在這裏我希望看到統一的名稱,它表明:
<h1 class="inline-block instiution-name" data-bind="text: Description"></h1>
但如果我使用Firebug或Chrome檢查元素,它顯示了
<h1 class="inline-block instiution-name" data-bind="text: Description">Bond University</h1>
在進一步的檢查,對「網」在firebug中,我可以看到有一個AJAX(?)調用正在返回相關信息,但我還沒有能夠在scrapy中模擬它,甚至沒有發現curl(是的,我搜索並花費了一段令人尷尬的長時間我害怕)。與請求
{"InstitutionId":20,"StudyAreaId":0}
作爲第二選擇通過
請求頭
POST /Websilk/DataServices/SurveyData.asmx/FetchInstitutionStudyAreaData HTTP/1.1
Host: www.qilt.edu.au
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/json; charset=utf-8
X-Requested-With: XMLHttpRequest
Referer: http://www.qilt.edu.au/institutions/institution/bond-university/business-management
Content-Length: 36
Cookie: _ga=GA1.3.69062787.1442441726; ASP.NET_SessionId=lueff4ysg3yvd2csv5ixsc1f; _gat=1
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
POST參數,我嘗試使用硒與scrapy,因爲我認爲這可能「看到」真正的價值,像瀏覽器一樣,但無濟於事。我到目前爲止的主要嘗試如下:
import scrapy
import time #used for the sleep() function
from selenium import webdriver
class QiltSpider(scrapy.Spider):
name = "qilt"
allowed_domains = ["qilt.edu.au"]
start_urls = [
"http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/"
]
def __init__(self):
self.driver = webdriver.Firefox()
self.driver.get('http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/')
time.sleep(5) # tried pausing, in case problem was delayed loading - didn't work
def parse(self, response):
# parse the response to find the uni name and show in console (using xpath code from firebug). This find the relevant section, but it shows as empty
title = response.xpath('//*[@id="bd"]/div[2]/div/div/div[1]/div/div[2]/h1').extract()
print title
# dumping the whole response to a file so I can check whether dynamic values were captured
with open("extract.html", 'wb') as f:
f.write(response.body)
self.driver.close()
任何人都可以告訴我怎麼做到這一點?
非常感謝!
編輯:感謝您的建議到目前爲止,但任何想法如何具體模仿AJAX調用與參數的InstitutionID和StudyAreaID?我的代碼來測試這個如下,但它似乎仍然打錯誤頁面。
import scrapy
from scrapy.http import FormRequest
class HeaderTestSpider(scrapy.Spider):
name = "headerTest"
allowed_domains = ["qilt.edu.au"]
start_urls = [
"http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/"
]
def parse(self, response):
return [FormRequest(url="http://www.qilt.edu.au/Websilk/DataServices/SurveyData.asmx/FetchInstitutionData",
method='POST',
formdata={'InstitutionId':'20', 'StudyAreaId': '0'},
callback=self.parser2)]
您可以使用[請求](http://www.python-requests.org/en/latest/)庫並模仿正在進行的* AJAX *調用。 –
由於Scrapy正在使用中,因此不需要「請求」。 – alecxe
爲什麼不使用Selenium並在瀏覽器中呈現頁面後將數據從頁面中刪除? – JeffC