Python：用js刮網頁

我想用selenium刮網。這裏例如頁面： https://www.linkedin.com/vsearch/p?firstName=mark Python：用js刮網頁

我可以在HTML看到搜索結果是在：

<div id='results-col'> ... </div>

但是當我嘗試使用訪問此標籤Beautifulsoup：

browser = webdriver.PhantomJS(executable_path=PATH) 
browser.get(url) 
bs_obj = BeautifulSoup(browser.page_source, "html.parser") 
results_col = bs_obj.find("div", {"id": "results-col"})

我什麼也沒得到（results_col =無）。我在做什麼錯？

來源

2016-12-14 Bob Sacamano

添加睡眠browser.get後的JS加載 – Tobey

Wait for the desired element在場，然後纔可以獲得網頁源：

from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 

# ... 
browser.get(url) 

wait = WebDriverWait(browser, 10) 
wait.until(EC.presence_of_element_located((By.ID, "results-col"))) 

bs_obj = BeautifulSoup(browser.page_source, "html.parser")

來源

2016-12-14 19:38:16 alecxe

我想你的代碼，但我得到：回溯（最近通話最後）：文件X，線142，在打印（get_link_to_profile（SEARCH_URL））文件X，線121，在get_link_to_profile wait.until（EC.presence_of_element_located（（By.ID，「結果-COL」）））文件「C：\ Users \ sergeyy \ AppData \ Roaming \ Python \ Python35 \ site-packages \ selenium \ we bdriver \ support \ wait.py「，第80行，直到 raise TimeoutException（消息，屏幕，堆棧跟蹤） selenium.common.exceptions.TimeoutException：消息：屏幕截圖：可通過屏幕 –

@BobSacamano可能意味着不同的事情，但是在PhantomJS打開的頁面上沒有這個元素。在加載頁面後查看'take_screenshot（）'方法的截圖並查看實際打開的內容。你可能需要用一些參數來啓動'PhantomJS'來使它工作：http://stackoverflow.com/questions/29463603/phantomjs-returning-empty-web-page-python-selenium。 – alecxe

@BobSacamano或者，您可能需要調整用戶代理以僞裝成不同的瀏覽器：https://coderwall.com/p/9jgaeq/set-phantomjs-user-agent-string。 – alecxe

Python：用js刮網頁

回答

相關問題