2017-06-08 48 views
0

我想解析跨越多個頁面的表格(或多個表格)。 我在下面的工作方式,但是太方便了,我希望它自動解析來自不同頁面的表並將它們合併爲一個。頁數可能並不總是相同的。如何使用Python自動分析跨越多個頁面的表格

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import pandas as pd 

one = "https://rittresultater.no/nb/sb_tid/923?page=0&pv2=11027&pv1=U" 
two = "https://rittresultater.no/nb/sb_tid/923?page=1&pv2=11027&pv1=U" 
three = "https://rittresultater.no/nb/sb_tid/923?page=2&pv2=11027&pv1=U" 

#parse the first page 
html = urlopen(one) 
soup = BeautifulSoup(html, "lxml") 
table = soup.find_all(class_="table-condensed") 
one = pd.read_html(str(table))[0] 

#parse the second page 
html = urlopen(two) 
soup = BeautifulSoup(html, "lxml") 
table = soup.find_all(class_="table-condensed") 
two = pd.read_html(str(table))[0] 

#parse thr third page 
html = urlopen(three) 
soup = BeautifulSoup(html, "lxml") 
table = soup.find_all(class_="table-condensed") 
three = pd.read_html(str(table))[0] 

df = pd.concat([one,two,three], axis = 0) 
df 

請注意,url只在「page = X」中有所不同。此外,網頁本身包含鏈接,例如。下一頁。

回答

1
results = {} 
for page_num in range(1, 10): #change depending on max page 
    address = 'https://rittresultater.no/nb/sb_tid/923?page=' + \ 
       str(page_num) + '&pv2=11027&pv1=U' 

    html = urlopen(address) 
    soup = BeautifulSoup(html, 'lxml') 
    table = soup.find_all(class='table-condensed') 
    output = pd.read_html(str(table))[0] 
    results[page_num] = output 

當對其做使用列表理解做培訓相關的事情輸出,如果它在你的代碼的最後一行,但擴大規模做到這一點:

df = pd.concat([v for v in results.values()], axis = 0) 
+1

完美!感謝您提供乾淨,好的答案 – NRVA

相關問題