使用web驅動程序從python的源頁面獲取所有文本

我使用selenium webdriver（firefox）從網站抓取一些數據。我剛剛發現打開網頁比打開網頁的源代碼要慢。換句話說，花了更長的時間去'www.google.com'比去'view-source:www.google.com'使用web驅動程序從python的源頁面獲取所有文本

所以我想知道我是否可以使用webdriver的從源頁獲取所有文字，而不是一個正常的頁面。

我嘗試使用driver.page_source作爲源頁面，但它返回了一些我不想要的混亂。

來源

2016-08-12 Marco

如果您只需要使用源碼requests。與PIP安裝它：

pip install requests

並使用它像這樣：

import requests 

r = requests.get("http://google.com/") 
# r.content, r.text, r.json(), r.status can be used

對於高級用法參考文檔上面。

注意：如果您需要解析html使用BeautifulSoup並通過它r.content。

來源

2016-08-12 21:29:21

是的，但我必須使用網絡驅動程序，因爲我需要手動通過rechaptcha檢查。 – Marco

[This]（http://stackoverflow.com/questions/7861775/python-selenium-accessing-html-source）應該爲您提供獲取源代碼的選項。此外，要優化加載速度，您可以禁用像[這裏]（http://stackoverflow.com/questions/25214473/disable-images-in-selenium-python）的圖像。 –

@ user3182260爲了通過驗證碼檢查，您可能需要渲染頁面，而不僅僅是下載源代碼。你可以試試PhantomJS而不是Selenium +瀏覽器。或者，它可能在另一個瀏覽器中渲染得更快。 – jpaugh

使用web驅動程序從python的源頁面獲取所有文本

回答

相關問題