我知道這是一個老問題,但最近我一直在與BeautifulSoup合作,並認爲我會提供解決方案。希望我在代碼中的評論能爲每個部分提供充分的解釋。
# instead of parsing the entire html doc, only parse the division that
# contains ER and IC wait times
soup = BeautifulSoup(req.text, 'lxml', parse_only=SoupStrainer(id='PageContent'))
讓我補充一點,許多網頁抓取工作只涉及提取一小部分html/xml響應。 SoupStrainer
類提供了一種很好的方法來減小soup
對象的大小,並深入瞭解html/xml文檔的相關部分。
waittimes = []
急診室等待時間包含在html <table>
中。表格中的每個<td>
標籤包含一個設施的一行信息。所有8個設施
# isolate the ER wait times in the only <table> in the soup object
ERtimes = soup('table')
# further isolate the data by filtering on the <td> tags
tds = ERtimes[0]('td')
# in this case every other <td> tag contains data. the others contain
# formatting instructions
for td in tds[::2]:
# extract only the text data in the <td> tag using stripped_strings generator
# keep only the facility name, wait time and the timestamp
waittimes.append([str for i, str in enumerate(td.stripped_strings) if i in [0, 2, 3]])
立即治療的等待時間包含在一個<div>
標籤。因此解析處理與ER時間不同。
# isolate the IC wait times which are contained in a <div> tag
# find all of the <div> tags but keep the last one which has the data
# for all 8 facilities
ICtimes = soup('div')[-1]
# once again use the stripped_strings generator to extract the pertinent text data
# however, in this case the data from the 8 facilities are returned into a single list
longlist = [str for str in ICtimes.stripped_strings]
# list comprehension to return the data from the single list into groups of 3
# which are facility name, wait time and timestamp
waittimes.extend([longlist[i:i+3] for i in xrange(0, len(longlist), 3)])
這裏是waittimes樣子:
In [162]: waittimes
Out[162]:
[[u'Alexian Brothers Medical Center',u'0 minutes',u'as of 1/7/2016 10:38:37 AM'],
[u'St. Alexius Medical Center',u'3 HOURS 40 MINS',u'as of 1/6/2016 6:51:11 PM'],
[u'Addison', u'30 minutes', u'as of 1/7/2016 10:44:44 AM'],
[u'Elk Grove', u'120 minutes', u'as of 1/7/2016 11:42:22 AM'],
[u'Bensenville', u'45 mins', u'as of 1/7/2016 10:49:05 AM'],
[u'Hanover Park', u'15 Mins', u'as of 1/7/2016 11:45:08 AM'],
[u'Mt. Prospect', u'1 hour', u'as of 1/7/2016 11:22:53 AM'],
[u'Schaumburg', u'1 hour', u'as of 1/7/2016 11:56:10 AM'],
[u'Palatine', u'30 minutes', u'as of 1/7/2016 12:06:35 PM']]
當然,我沒有寫清理數據所需的程序。我專注於提取相關數據並以結構化格式進行呈現。
類似[this - BSoup select with css selectors](https://stackoverflow.com/questions/15920039/beautifulsoup-how-to-select-certain-tag)也許? –
嗯不完全。我已經瀏覽了該示例之前 – JJThaeler