當url不改變時，使用Selenium在多個頁面上刮取表格

我一直在嘗試編寫一個程序來從www.whoscored.com中刮取統計信息並創建一個熊貓數據框。當url不改變時，使用Selenium在多個頁面上刮取表格

我已經更新了crookedleaf的幫助的代碼，這是工作代碼：

import time 
import pandas as pd 
from pandas.io.html import read_html 
from pandas import DataFrame 
from selenium import webdriver 

driver = webdriver.Firefox() 
driver.get('https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/PlayerStatistics/England-Premier-League-2016-2017') 

summary_stats = DataFrame() 

while True: 

    while driver.find_element_by_xpath('//*[@id="statistics-table-summary"]').get_attribute('class') == 'is-updating': # driver.find_element_by_xpath('//*[@id="statistics-table-summary-loading"]').get_attribute('style') == 'display; block;' or 
     time.sleep(1) 

    table = driver.find_element_by_xpath('//*[@id="statistics-table-summary"]') 
    table_html = table.get_attribute('innerHTML') 
    page_number = driver.find_element_by_xpath('//*[@id="currentPage"]').get_attribute('value') 
    print('Page ' + page_number) 
    df1 = read_html(table_html)[0] 
    summary_stats = pd.concat([summary_stats, df1]) 
    next_link = driver.find_element_by_xpath('//*[@id="next"]') 

    if 'disabled' in next_link.get_attribute('class'): 
     break 

    next_link.click() 

print(summary_stats) 

driver.close()

現在我試圖從其他選項卡收集統計信息。我真的很接近，但是代碼並沒有退出循環，當它應該被打破。這是下面的代碼：

defensive_button = driver.find_element_by_xpath('//*[@id="stage-top-player-stats-options"]/li[2]/a') 
defensive_button.click() 

defensive_stats = DataFrame() 

while True: 

    while driver.find_element_by_xpath('//*[@id="statistics-table-defensive"]').get_attribute('class') == 'is-updating': # driver.find_element_by_xpath('//*[@id="statistics-table-summary-loading"]').get_attribute('style') == 'display; block;' or 
     time.sleep(1) 

    table = driver.find_element_by_xpath('//*[@id="statistics-table-defensive"]') 
    table_html = table.get_attribute('innerHTML') 
    page_number = driver.find_element_by_xpath('//*[@id="statistics-paging-defensive"]/div/input[1]').get_attribute('value') 
    print('Page ' + page_number) 
    df2 = read_html(table_html)[0] 
    defensive_stats = pd.concat([defensive_stats, df2]) 
    next_link = driver.find_element_by_xpath('//*[@id="statistics-paging-defensive"]/div/dl[2]/dd[3]') 

    if 'disabled' in next_link.get_attribute('class'): 
     break 

    next_link.click() 

print(defensive_stats)

代碼遍歷所有的頁面，但隨後一直到最後一個頁面

來源

2017-03-06 jchadwick92

您要定義循環之外你的表的代碼循環。您正在導航到下一頁，但未重新定義您的table和table_html元素。將它們移動到第一行後while True

編輯：在對代碼進行更改後，我的猜測是由於表的動態加載內容，您無法處理更改或無法獲取到期內容到「加載」圖形覆蓋。另一件事情是可能並不總是有30頁。例如，今天有29個，因此它不斷從第29頁獲取數據。我修改了代碼以繼續運行，直到不再啓用「下一個」按鈕，並且我等待檢查以查看錶加載，然後再繼續：

import time 
from pandas.io.html import read_html 
from pandas import DataFrame 
from selenium import webdriver 

driver = webdriver.Chrome(path-to-your-chromedriver) 
driver.get('https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/PlayerStatistics/England-Premier-League-2016-2017') 

df = DataFrame() 

while True: 

    while driver.find_element_by_xpath('//*[@id="statistics-table-summary"]').get_attribute('class') == 'is-updating': # driver.find_element_by_xpath('//*[@id="statistics-table-summary-loading"]').get_attribute('style') == 'display; block;' or 
     time.sleep(1) 

    table = driver.find_element_by_xpath('//*[@id="statistics-table-summary"]') 
    table_html = table.get_attribute('innerHTML') 
    page_number = driver.find_element_by_xpath('//*[@id="currentPage"]').get_attribute('value') 
    print('Page ' + page_number) 
    df1 = read_html(table_html)[0] 
    df.append(df1) 
    next_link = driver.find_element_by_xpath('//*[@id="next"]') 

    if 'disabled' in next_link.get_attribute('class'): 
     break 

    next_link.click() 


print(df) 

driver.close()

但是，我在運行這個月底得到一個空DataFrame。我很遺憾地不熟悉pandas來確定問題，但它與df.append()有關。我用它在每個循環中打印出df1的值，然後打印出正確的數據，但它不會將它添加到DataFrame。這可能是您已經足夠熟悉的事情來實施完全運行所需的更改。

編輯2：花了我一段時間來找出這一個。本質上，該頁面的內容正在使用JavaScript動態加載。你聲明的'next'元素仍然是你遇到的第一個'next'按鈕。每次點擊一個新標籤時，'next'元素的數量都會增加。我已經添加了一個編輯，成功導航所有標籤（除了'詳細'標籤...希望你不需要這一個大聲笑）。但是，我仍然變空DataFrame()的

import time 
import pandas as pd 
from pandas.io.html import read_html 
from pandas import DataFrame 
from selenium import webdriver 

driver = webdriver.Chrome('/home/mdrouin/Downloads/chromedriver') 
driver.get('https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/PlayerStatistics/England-Premier-League-2016-2017') 

statistics = { # this is a list of all the tabs on the page 
    'summary': DataFrame(), 
    'defensive': DataFrame(), 
    'offensive': DataFrame(), 
    'passing': DataFrame() 
} 

count = 0 
tabs = driver.find_element_by_xpath('//*[@id="stage-top-player-stats-options"]').find_elements_by_tag_name('li') # this pulls all the tab elements 
for tab in tabs[:-1]: # iterate over the different tab sections 
    section = tab.text.lower() 
    driver.find_element_by_xpath('//*[@id="stage-top-player-stats-options"]').find_element_by_link_text(section.title()).click() # clicks the actual tab by using the dictionary's key (.proper() makes the first character in the string uppercase) 
    time.sleep(3) 
    while True: 
     while driver.find_element_by_xpath('//*[@id="statistics-table-%s"]' % section).get_attribute('class') == 'is-updating': # string formatting on the xpath to change for each section that is iterated over 
      time.sleep(1) 

     table = driver.find_element_by_xpath('//*[@id="statistics-table-%s"]' % section) # string formatting on the xpath to change for each section that is iterated over 
     table_html = table.get_attribute('innerHTML') 
     df = read_html(table_html)[0] 
     # print df 
     pd.concat([statistics[section], df]) 
     next_link = driver.find_elements_by_xpath('//*[@id="next"]')[count] # makes sure it's selecting the correct index of 'next' items 
     if 'disabled' in next_link.get_attribute('class'): 
      break 
     time.sleep(5) 
     next_link.click() 
    count += 1 


for df in statistics.values(): # iterates over the DataFrame() elemnts 
    print df 

driver.quit()

來源

2017-03-06 23:26:47 crookedleaf

我已更新代碼，但仍存在問題。我會很感激，如果你再看一看 – jchadwick92

@ jchadwick92檢查我的更新到我的答案，並讓我知道你是否有任何問題 – crookedleaf

代碼工作很好，非常感謝你。現在我已經轉到了防守部分，但我在退出循環時遇到了問題，您可以再看一下嗎？ – jchadwick92

當url不改變時，使用Selenium在多個頁面上刮取表格

回答

相關問題