BeautifulSoup HTML表格解析

我試圖解析來自該網站的信息（HTML表格）：http://www.511virginia.org/RoadConditions.aspx?j=All&r=1 BeautifulSoup HTML表格解析

目前我使用BeautifulSoup，我有這個樣子的

from mechanize import Browser 
from BeautifulSoup import BeautifulSoup 

mech = Browser() 

url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1" 
page = mech.open(url) 

html = page.read() 
soup = BeautifulSoup(html) 

table = soup.find("table") 

rows = table.findAll('tr')[3] 

cols = rows.findAll('td') 

roadtype = cols[0].string 
start = cols.[1].string 
end = cols[2].string 
condition = cols[3].string 
reason = cols[4].string 
update = cols[5].string 

entry = (roadtype, start, end, condition, reason, update) 

print entry

的問題是與代碼開始和結束列。他們只是打印爲「無」

輸出：

(u'Rt. 613N (Giles County)', None, None, u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

我知道他們得到存儲在列名單，但似乎額外的鏈接標籤被搞亂了原始的HTML看解析像這樣：

<td headers="road-type" class="ConditionsCellText">Rt. 613N (Giles County)</td> 
<td headers="start" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Big Stony Ck Rd; Rt. 635E/W (Giles County)</a></td> 
<td headers="end" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)</a></td> 
<td headers="condition" class="ConditionsCellText">Moderate</td> 
<td headers="reason" class="ConditionsCellText">snow or ice</td> 
<td headers="update" class="ConditionsCellText">01/13/2010 10:50 AM</td>

那麼應該怎麼印的是：

(u'Rt. 613N (Giles County)', u'Big Stony Ck Rd; Rt. 635E/W (Giles County)', u'Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)', u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

任何suggesti感謝您的幫助，並感謝您的提前。

來源

2010-01-13 Stephen Tanner

非常感謝你 –

你不必爲此使用美麗的湯。你可以使用python3 htmlparser：https://github.com/schmijos/html-table-parser-python3/blob/master/html_table_parser/parser.py – schmijos

start = cols[1].find('a').string

或簡單

start = cols[1].a.string

或更好

start = str(cols[1].find(text=True))

和

entry = [str(x) for x in cols.findAll(text=True)]

來源

2010-01-13 18:56:45

我用str（cols ...）方法去了。謝謝。 –

+21

歡迎）如果你接受了一個答案，如果你覺得它有幫助，這將是一件好事 –

我同意，@Stephon Tanner將返回並接受這個答案 – Neil

我試圖重現你的錯誤，但源HTML頁面被更改。

關於錯誤，我也有類似的問題，試圖重現例子here

變化所提出的網址爲a Wikipedia Table

我固定它移動到BeautifulSoup4

from bs4 import BeautifulSoup

和改變.string for .get_text()

start = cols[1].get_text()

我無法用您的示例進行測試（正如我之前所說，我無法重現該錯誤），但我認爲這對於人們正在尋找解決此問題的方法可能很有用。

來源

2014-01-18 14:05:57 evinhas

BeautifulSoup HTML表格解析

回答

相關問題