使用selenium，beautifulsoup和python進行網頁掃描

當前正在使用javascript進行搜索的房地產網站。我的過程首先爲包含單個列表的包含多個不同href鏈接的列表開始，將這些鏈接附加到另一個列表，然後按下一個按鈕。我這樣做直到下一個按鈕不再可點擊。使用selenium，beautifulsoup和python進行網頁掃描

我的問題是，收集所有列表（~13000鏈接）後，刮板不會移動到第二部分，打開鏈接並獲取我需要的信息。 Selenium甚至不打開鏈接列表的第一個元素。

繼承人我的代碼：

wait = WebDriverWait(driver, 10) 
while True: 
    try: 
     element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next'))) 
     html = driver.page_source 
     soup = bs.BeautifulSoup(html,'html.parser') 
     table = soup.find(id = 'search_main_div') 
     classtitle = table.find_all('p', class_= 'title') 
     for aaa in classtitle: 
      hrefsyo = aaa.find('a', href = True) 
      linkstoclick = hrefsyo.get('href') 
      houselinklist.append(linkstoclick) 
     element.click() 
    except: 
     pass

在此之後我還有一個簡單的刮刀，通過列表的例子不勝枚舉，打開它們的硒和收集對目錄資料。

for links in houselinklist: 
    print(links) 
    newwebpage = links 
    driver.get(newwebpage) 
    html = driver.page_source 
    soup = bs.BeautifulSoup(html,'html.parser') 
    . 
    . 
    . 
    . more code here

來源

2017-07-31 bathtubandatoaster

您正在刮的鏈接在哪裏？ – ksai

https://www.28hse.com/cn/rent/house-type-g1 – bathtubandatoaster

你得到了什麼錯誤？ – ksai

問題是while True:創建一個運行無窮大的循環。你的except子句有一個pass語句，這意味着一旦發生錯誤，循環只是繼續運行。相反，它可以寫成

wait = WebDriverWait(driver, 10) 
while True: 
    try: 
     element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next'))) 
     html = driver.page_source 
     soup = bs.BeautifulSoup(html,'html.parser') 
     table = soup.find(id = 'search_main_div') 
     classtitle = table.find_all('p', class_= 'title') 
     for aaa in classtitle: 
      hrefsyo = aaa.find('a', href = True) 
      linkstoclick = hrefsyo.get('href') 
      houselinklist.append(linkstoclick) 
     element.click() 
    except: 
     break # change this to exit loop

一旦出現錯誤時，循環break並移動到下一行代碼

，或者就可以消除while循環，只是循環在你的使用for循環的href鏈接列表

wait = WebDriverWait(driver, 10) 
hrefLinks = ['link1','link2','link3'.....] 
for link in hrefLinks: 
    try: 
     driver.get(link) 
     element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next'))) 
     html = driver.page_source 
     soup = bs.BeautifulSoup(html,'html.parser') 
     table = soup.find(id = 'search_main_div') 
     classtitle = table.find_all('p', class_= 'title') 
     for aaa in classtitle: 
      hrefsyo = aaa.find('a', href = True) 
      linkstoclick = hrefsyo.get('href') 
      houselinklist.append(linkstoclick) 
     element.click() 
    except: 
     pass #pass on error and move on to next hreflink

來源

2017-07-31 05:29:34 DJK

這是否解決您的問題？ – DJK

喲感謝隊友 – bathtubandatoaster

使用selenium，beautifulsoup和python進行網頁掃描

回答

相關問題