獲取內部HTML - Selenium，BeautifulSoup，Python

這是一個完整的問題編輯，因爲我一定根據答案問了我的問題 - 所以我會盡量更清楚。獲取內部HTML - Selenium，BeautifulSoup，Python

我有一個對象，我試圖刮。在我的筆記本電腦上使用我的代碼，我沒有任何問題得到這個工作。當我轉移到Pythonanywhere時，我不再能夠獲得我正在尋找的信息。

，我的系統上工作的代碼是：

from urllib.request import urlopen 
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import NoSuchElementException 
from selenium.common.exceptions import TimeoutException 
from selenium.webdriver.common.keys import Keys 
from bs4 import BeautifulSoup 
import csv 
import time 
import re 

#68 lines of code for another section of the site above this working well on my system and on pythonanywhere. 

pageSource = driver.page_source 
bsObj = BeautifulSoup(pageSource) 

try: 
    parcel_number = bsObj.find(id="mParcelnumbersitusaddress_mParcelNumber") 
    s_parcel_number =parcel_number.get_text()       
except AttributeError as e: 
    s_parcel_number = "Parcel Number not found" 

# same kind of code (all working) that gets 10 more pieces of data 

# Tax Year 
try: 
    pause = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "TaxesBalancePaymentCalculator"))) 
    taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0] 
except IndexError as e: 
    s_taxes_owed_2015_yr = "No taxes due"

此代碼工作就好了我的筆記本電腦與fireforx - 上Pythonanywhere如果我打印了我試圖刮我碰到下面的頁面pagesource在我的表應該是：

<table border="0" cellpadding="5" cellspacing="0" class="WithBorder" width="100%"> 
<tbody><tr> 
<td id="TaxesBalancePaymentCalculator"><!--DONT_PRINT_START--> 
<span class="InputFieldTitle" id="mTabGroup_Taxes_mTaxChargesBalancePaymentInjected_mReportProcessingNote">Please wait while your current taxes are calculated.</span><img src="images/progress.gif"/> <!--DONT_PRINT_FINISH--></td> 
</tr> <!--DONT_PRINT_START--> 
<script type="text/javascript"> 
           function TaxesBalancePaymentCalculator_ScriptLoaded(pPageContent) 
           { 
            element('TaxesBalancePaymentCalculator').innerHTML = pPageContent; 
           } 
           function results_ready() 
           { 
            element('pay_button_area').style.display = 'block'; 
            element('pay_button_area2').style.display = 'block'; 
            element('pay_additional_things_area').style.display = 'block'; 
           } 
           var no_taxes_calculator = '&amp;nbsp;&lt;' + 'span class="MessageTitle"&gt;The tax balance calculator is not availab 
le.&lt;' + '/span&gt;'; 
           function no_taxes_calculator_available() 
           { 
            element('TaxesBalancePaymentCalculator').innerHTML = no_taxes_calculator; 
           } 
           function invalid() 
           { 
            element('TaxesBalancePaymentCalculator').innerHTML = no_taxes_calculator; 
           } 
           loadScript('injected/TaxesBalancePaymentCalculator.aspx?parcel_number=15-720-01-01-00-0-00-000'); 
           </script><script id="injected_taxesbalancepaymentcalculator_ScriptTag" type="text/javascript"></script> 
<tr id="pay_button_area" style="DISPLAY: none"> 
<td id="pay_button_area2"> 
<table border="0" cellpadding="2" cellspacing="0"> 
<tbody><tr>

我打了四周，發現如果我得到的innerHTML（作爲STR）：

element('TaxesBalancePaymentCalculator').innerHTML = pPageContent;

該部分對我的數據 - 問題是我不能在一個字符串瓶坯的findAll，我需要從表中的某些行：

taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]

我需要如何獲取元素作爲對象幫助（而不是字符串），以便我可以在我的數據中使用它。我嘗試了很多東西，所以我無法在這裏列出它們。我真的可以請一些幫助。

在此先感謝。

來源

2015-12-15 Raymond

我不記得'Python'中的任何'findAll'方法。這是'bs4'方法...在代碼中輸入'bs4'？你想用'bsObj'做什麼？ – Andersson

是的，它是一個bs4方法，我已經導入bs4 ---幾百行更高。我試圖從內部HTML中的表中獲取信息 - – Raymond

根據文檔，driver.get_attribute返回一個字符串，因此出現錯誤。 – Steve

正如@Steve在評論中指出的那樣，get_attribute返回字符串，而不是HTML元素。嘗試用一些get_element_by_ *替換此行。你可以閱讀更多的文檔http://selenium-python.readthedocs.org/api.html#selenium.webdriver.remote.webelement.WebElement.find_element_by_tag_name

除此之外，你正在使用beautifulsoup錯誤的方式。你需要通過傳遞HTML作爲參數來創建BS4對象，然後您使用的findAll的對象：

soup = BeautifulSoup(html_as_plain_text) 
for element in soup.findAll(id="mGrid_RealDataGrid"): 
    #do your thing

來源

2015-12-15 15:24:39

從我在代碼中看到，你想獲得一個元素和飼料的innerHTML它到BeautifulSoup進一步解析。首先，你可能需要outerHTML獲得元素本身所產生的HTML和，也是最重要的，你需要初始化「湯」對象：

from bs4 import BeautifulSoup 

demo_div = driver.find_element_by_id('TaxesBalancePaymentCalculator') 
demo_html = demo_div.get_attribute('outerHTML') 

soup = BeautifulSoup(demo_html, "html.parser") # < YOU ARE MISSING THIS PART 
s_taxes_owed_2015_yr = soup.find_all(id="mGrid_RealDataGrid")[1].find_all('tr')[1].find_all('td')[0].get_text() 
print(s_taxes_owed_2015_yr)

來源

2015-12-15 15:52:04 alecxe

好 - 但我仍然得到一個超出限制錯誤的元素，因爲該表永遠不會加載到pythonanywhere的Firefox瀏覽器中。 – Raymond

@雷蒙德，這是一個單獨的問題。讓我們避免在單個主題中解決多個問題。如果你需要幫助，請考慮創建一個單獨的問題與細節。謝謝。 – alecxe

我覺得這可能是一個頁面加載速度差。在你的代碼開始時，你有

pageSource = driver.page_source 
bsObj = BeautifulSoup(pageSource)

所以，你基於頁面內容創建你的BeautifulSoup對象。後來，你這樣做是：

pause = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "TaxesBalancePaymentCalculator"))) 
taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]

所以，你告訴的webdriver等到事情已經出現了，然後又做了查詢到先前創建的BeautifulSoup對象。但BeautifulSoup對象仍然擁有腳本開始處的頁面源代碼，而不是包含您等待的對象的新頁面源代碼。

嘗試在完成等待後重新創建基於新頁面源的bsObj。

來源

2015-12-15 18:06:28

非常棒---效果很好，謝謝指出。 – Raymond

獲取內部HTML - Selenium，BeautifulSoup，Python

回答

相關問題