2016-02-14 16 views
1

我正在嘗試編寫一個腳本,它可以從特定的website中獲取作業詳細信息。當我從Google Chrome瀏覽器的源代碼(command-option-U)與開發人員工具(command-option-I)進行查看時,html代碼顯得不同。開發者工具具有我可以在HTML中解析的實際細節。將表格提交到作業板的廢料數據

的什麼,我從網站上首次發佈招聘信息後發現一個例子:

加拿大阿爾伯塔省,麥克默裏堡,加拿大阿爾伯塔省埃德蒙頓

我知道我需要使用POST提交表單,但除此之外,我無法獲取在Developer Tools中找到的HTML代碼,但在我的請求中不存在。

import requests 
url='https://caterpillar.taleo.net/careersection/cat+external+cs/jobsearch.ftl?lang=en&portal=4140124208&src=CWS-10005' 
r = requests.post(url, data={'dropListSize': 100}) 
print(r.status_code, r.reason) 
html=r.text 

我自己也嘗試了類似的策略使用機械化

import mechanize 
br = mechanize.Browser() 
br.open(url) 

for f in br.forms(): 
    print f 

br.select_form('ftlform') 
br.form["dropListSize"] = ["100"] 
br.submit() 
html=br.response().read() 

一個相關的問題是我怎麼會進入下一個頁面,但我覺得我也許能弄明白。

回答

2

有一個XHR POST請求發送到https://caterpillar.taleo.net/careersection/cat+external+cs/jobsearch.ajax端點,其中包含響應中的所有搜索結果。你可以嘗試模擬它(我猜根據參數數量和響應格式判斷它不會很有趣),或者你可以通過selenium在真實的瀏覽器中加載頁面,讓瀏覽器加載頁面,而不要擔心搜索結果如何傳遞。

使用selenium + PhantomJS模擬瀏覽器工作實施例:

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 


url = 'https://caterpillar.taleo.net/careersection/cat+external+cs/jobsearch.ftl?lang=en&portal=4140124208&src=CWS-10005' 
driver = webdriver.PhantomJS() 
driver.get(url) 

wait = WebDriverWait(driver, 10) 
table = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.contentlist"))) 

for row in table.find_elements_by_css_selector("tr.ftlrow"): 
    title = row.find_element_by_css_selector(".titlelink a").text 
    print(title) 

driver.close() 

打印:

Sales accountant 
Manufacturing Project Engineer 
Staff Accountant - Accountable 
Hydraulic Cylinder Design Engineer 
Engineering Supervisor(Hydraulic Cylinder) 
Design Engineer 
Senior Design Engineer 
Senior Engineer 
Senior Design Engineer 
Dealer Solution Network (DSN) Analyst