2017-06-23 51 views
0

我想從這些54897 pages中的每一個下載整個HTML。但是,Selenium在點擊下一頁時不會重新加載頁面,至少只是表面上的。運行代碼後,我意識到所有54897文件都是相同的。它只是不斷下載第一個文件。任何人都可以看到這個問題的解決方案?這裏是我的代碼:使用硒從網頁拉動態CDATA

from bs4 import BeautifulSoup 
import time 
import progressbar 
from selenium import webdriver 
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.support.ui import Select 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.common.exceptions import TimeoutException 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait as wait 

driver = webdriver.Firefox() 
url = 'https://www.parlament.ch/de/ratsbetrieb/suche-curia-vista' 
driver.get(url) 

bar = progressbar.ProgressBar() 

for elem, i in zip(range(0,5489), bar(range(5489))): 
    driver.switch_to.default_content() 
    html = BeautifulSoup(driver.page_source, 'html5lib') 

    file = open('myfolder/' + str(elem) + ".txt", "w") 
    file.write(str(html)) 
    file.close() 

    time.sleep(1.5) 
    driver.find_element_by_id('PageLinkNext').click() 

    time.sleep(0.02) 

回答

0

通常情況下,您需要獲取您加載頁面的DOM。

這裏我加載第一頁,點擊下一頁,根據body元素的id獲取它的DOM。我要求innerHTML這個元素,以便(a)我可以使用BeautifulSoup解析它,以證明內容與第一頁不同,並且(b)它可以作爲你的五萬多個文檔之一保存到文件中。

>>> from selenium import webdriver 
>>> driver = webdriver.Chrome() 
>>> driver.get('https://www.parlament.ch/de/ratsbetrieb/suche-curia-vista') 
>>> driver.find_element_by_id('PageLinkNext').click() 
>>> DOM = driver.execute_script('return document.getElementById("ng-app").innerHTML;') 
>>> page = bs4.BeautifulSoup(DOM, 'lxml') 
>>> page.find_all('h4', {'class', "ms-srch-item-area"}) 
[<h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173611">Interpellation - Herzog Verena</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173610">Interpellation - Tornare Manuel</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173609">Postulat - Gmür Alois</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173608">Interpellation - Reynard Mathias</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173607">Motion - FDP-Liberale Fraktion</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173606">Interpellation - Bourgeois Jacques</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173605">Motion - Gmür-Schönenberger Andrea</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173604">Motion - Fraktion BD</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173603">Postulat - Dettling Marcel</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173602">Postulat - Mazzone Lisa</a> </h4>] 
>>> driver.quit() 
>>> from selenium import webdriver 
>>> driver = webdriver.Chrome() 
>>> driver.get('https://www.parlament.ch/de/ratsbetrieb/suche-curia-vista') 
>>> driver.find_element_by_id('PageLinkNext').click() 
>>> DOM = driver.execute_script('return document.getElementById("ng-app").innerHTML;') 
>>> import bs4 
>>> page = bs4.BeautifulSoup(DOM, 'lxml') 
>>> page.find_all('h4', {'class', "ms-srch-item-area"}) 
[<h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173611">Interpellation - Herzog Verena</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173610">Interpellation - Tornare Manuel</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173609">Postulat - Gmür Alois</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173608">Interpellation - Reynard Mathias</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173607">Motion - FDP-Liberale Fraktion</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173606">Interpellation - Bourgeois Jacques</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173605">Motion - Gmür-Schönenberger Andrea</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173604">Motion - Fraktion BD</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173603">Postulat - Dettling Marcel</a> </h4>, <h4 class="ms-srch-item-area"> <a href="/de/ratsbetrieb/suche-curia-vista/geschaeft?AffairId=20173602">Postulat - Mazzone Lisa</a> </h4>] 

如果這適用於您,請將其標記爲'accepted',以便其他人可以找到它。