如何通過使用硒獲取HTML呈現源代碼

我在一個網頁上運行查詢，然後我得到結果URL。如果我右鍵點擊看到html源碼，我可以看到由JS生成的html代碼。如果我只是使用urllib，python無法獲得JS代碼。所以我看到一些使用硒的解決方案。這裏是我的代碼：如何通過使用硒獲取HTML呈現源代碼

from selenium import webdriver 
url = 'http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2' 
driver = webdriver.PhantomJS(executable_path='C:\python27\scripts\phantomjs.exe') 
driver.get(url) 
print driver.page_source 

>>> <html><head></head><body></body></html>   Obviously It's not right!!

這裏的源代碼，我需要右擊窗口，（我想要的信息的一部分）

</script></div><div class="searchColRight"><div id="topActions" class="clearfix 
noPrint"><div id="breadcrumbs" class="left"><a title="Results Summary" 
href="Default.aspx? _act=VitalSearchR ...... <<INFORMATION I NEED>> ... 
to view the entire record.</p></div><script xmlns:msxsl="urn:schemas-microsoft-com:xslt"> 

     jQuery(document).ready(function() { 
      jQuery(".ancestry-information-tooltip").actooltip({ 
href: "#AncestryInformationTooltip", orientation: "bottomleft"}); 
     });

===========所以我的問題=============== 如何獲取JS生成的信息？

來源

2014-03-30 MacSanhe

什麼是你想要的HTML代碼看起來像在網頁上？你會想使用selenium的'get_element_by_ *'函數中的一個，但具體取決於html本身。 – Victory

我的意思是一切。例如，你在谷歌輸入的東西。在結果網頁中，右鍵單擊，查看源代碼。這就是我想要的「一切」。 – MacSanhe

您將需要獲得通過javascript獲取文檔，你可以使用seleniums execute_script功能

from time import sleep # this should go at the top of the file 

sleep(5) 
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML") 
print html

這將得到<html>標籤

來源

2014-03-30 02:35:51 Victory

然後我只得到： .....怎麼....>< – MacSanhe

它看起來的工作，但只給了我，我重新定義了我的問題在那裏，你能採取請再看看這個問題嗎？非常感謝 – MacSanhe

@MacSanhe看到我的編輯，如果頁面沒有完全加載，你不會得到所有的正文內容。也可以嘗試訪問頁面並在調試器控制檯中運行'document.getElementsByTagName（'html'）[0] .innerHTML'來查看DOM到底有多少。 – Victory

我在想，你所得到的源代碼，裏面的一切在JavaScript呈現動態HTML之前。

最初嘗試在導航和獲取頁面源之間放置幾秒鐘睡眠。

如果這有效，那麼您可以更改爲不同的等待策略。

來源

2014-03-30 14:55:31

這是沒有必要使用該解決方案，您可以改用：

driver = webdriver.PhantomJS() 
driver.get('http://www.google.com/') 
html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')

來源

2017-04-22 22:06:01

-1

我遇到了同樣的問題，最後由desired_capabilities解決。

from selenium import webdriver 
from selenium.webdriver.common.proxy import Proxy 
from selenium.webdriver.common.proxy import ProxyType 

proxy = Proxy(
    { 
      'proxyType': ProxyType.MANUAL, 
      'httpProxy': 'ip:host' 
    } 
) 
desired_capabilities = webdriver.DesiredCapabilities.PHANTOMJS.copy() 
proxy.add_to_capabilities(desired_capabilities) 
driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities) 
driver.get('test_url') 
print driver.page_source

來源

2017-05-24 07:15:28 Vida

您嘗試Dryscrape這個瀏覽器完全支持重JS代碼嘗試它，我希望它爲你工作

來源

2017-12-11 20:36:44 Harry1992

這是一條評論，而不是答案 –

如何通過使用硒獲取HTML呈現源代碼

回答

相關問題