BeautifulSoup Python中的嵌套標籤/表格

我已經在Google上搜索了半天，尋找正確的答案。我最近的事情是這個StackOverflow後：Nested tags in BeautifulSoup - Python BeautifulSoup Python中的嵌套標籤/表格

有效地，我從一個複雜的頁面與嵌套元素使用BeautifulSoup在Python中的等待時間數據。一些HTML元素有class/id，但大多數不是。看看DOM，我可以看到我想要的元素的路徑。我寫了一個指向正確路徑的初步腳本（...我認爲），但控制檯不斷打印出一個空數組。即使更改此代碼以打印出一些簡單的內容（如soup.select（'body h2'）），也不會打印任何內容。這裏是我的代碼

from BeautifulSoup import BeautifulSoup 
import requests 

url = 'http://www.alexianbrothershealth.org/wait-times' 
r = requests.get(url) 

soup = BeautifulSoup(r.text) 
wait_times = soup.select('body div div div div div div div table tbody tr td') 

print wait_times

任何想法，我需要改變，使這項工作？我有更多的網站需要，因此找出.select（）指針的正確語法確實會有所幫助。我試過使用XPath的lxml，並打印出一個空的數組。頁面源代碼告訴我，它在HTML中，而不是通過客戶端上的JavaScript加載，所以我應該沒問題。

PS我對於新手，那麼任何複雜的答案將完全失去了我;）

來源

2014-02-20 JJThaeler

類似[this - BSoup select with css selectors]（https://stackoverflow.com/questions/15920039/beautifulsoup-how-to-select-certain-tag）也許？ –

嗯不完全。我已經瀏覽了該示例之前 – JJThaeler

我不認爲選擇是你要尋找的BeautifulSoup方法。選擇使用css選擇器，但你只是尋找正確的標籤集。

如果你正在尋找的時間都在的，我會用

tds = soup.find_all("td") 
for cell in tds: 
    children = cell.findChildren() 
    ... do actual work ...

另外，如果你想使用select（你絕對可以），嘗試刪除整個第一組的標籤：

soup.select("table tbody tr td")

工作正常。

來源

2014-02-20 21:03:19

我已經打印所有td之前在此網站上的其他示例。問題不在於獲取所有特定元素，而是找到我想要的特定元素的路徑。例如，如果該頁面有5000個td，並且每個td都有一個孩子，那麼我就必須弄清楚我需要哪個td的孩子。也許我這樣做是通過創建一個數組，並花費很長時間試圖確定哪個孩子是正確的值，即孩子[3,018]。這看起來超級低效，絕對不是編碼器的方式！嘗試.select方法讓我空陣列。 – JJThaeler

我建議的小選擇工作？它似乎在你鏈接的頁面上。 –

我試過了，我一直在控制檯中得到一個空白數組。我會在我的問題中加入一個屏幕截圖，如果我可以......不，我需要10個聲望點才能發佈圖片。它只是輸出[] – JJThaeler

我知道這是一個老問題，但最近我一直在與BeautifulSoup合作，並認爲我會提供解決方案。希望我在代碼中的評論能爲每個部分提供充分的解釋。

# instead of parsing the entire html doc, only parse the division that 
# contains ER and IC wait times 
soup = BeautifulSoup(req.text, 'lxml', parse_only=SoupStrainer(id='PageContent'))

讓我補充一點，許多網頁抓取工作只涉及提取一小部分html/xml響應。 SoupStrainer類提供了一種很好的方法來減小soup對象的大小，並深入瞭解html/xml文檔的相關部分。

waittimes = []

急診室等待時間包含在html <table>中。表格中的每個<td>標籤包含一個設施的一行信息。所有8個設施

# isolate the ER wait times in the only <table> in the soup object 
ERtimes = soup('table') 
# further isolate the data by filtering on the <td> tags 
tds = ERtimes[0]('td') 

# in this case every other <td> tag contains data. the others contain 
# formatting instructions 
for td in tds[::2]: 
# extract only the text data in the <td> tag using stripped_strings generator 
# keep only the facility name, wait time and the timestamp 
waittimes.append([str for i, str in enumerate(td.stripped_strings) if i in [0, 2, 3]])

立即治療的等待時間包含在一個<div>標籤。因此解析處理與ER時間不同。

# isolate the IC wait times which are contained in a <div> tag 
# find all of the <div> tags but keep the last one which has the data 
# for all 8 facilities 
ICtimes = soup('div')[-1] 

# once again use the stripped_strings generator to extract the pertinent text data 
# however, in this case the data from the 8 facilities are returned into a single list 
longlist = [str for str in ICtimes.stripped_strings] 
# list comprehension to return the data from the single list into groups of 3 
# which are facility name, wait time and timestamp 
waittimes.extend([longlist[i:i+3] for i in xrange(0, len(longlist), 3)])

這裏是waittimes樣子：

In [162]: waittimes 
Out[162]: 
[[u'Alexian Brothers Medical Center',u'0 minutes',u'as of 1/7/2016 10:38:37 AM'], 
[u'St. Alexius Medical Center',u'3 HOURS 40 MINS',u'as of 1/6/2016 6:51:11 PM'], 
[u'Addison', u'30 minutes', u'as of 1/7/2016 10:44:44 AM'], 
[u'Elk Grove', u'120 minutes', u'as of 1/7/2016 11:42:22 AM'], 
[u'Bensenville', u'45 mins', u'as of 1/7/2016 10:49:05 AM'], 
[u'Hanover Park', u'15 Mins', u'as of 1/7/2016 11:45:08 AM'], 
[u'Mt. Prospect', u'1 hour', u'as of 1/7/2016 11:22:53 AM'], 
[u'Schaumburg', u'1 hour', u'as of 1/7/2016 11:56:10 AM'], 
[u'Palatine', u'30 minutes', u'as of 1/7/2016 12:06:35 PM']]

當然，我沒有寫清理數據所需的程序。我專注於提取相關數據並以結構化格式進行呈現。

來源

2016-01-07 20:12:35 floydn

BeautifulSoup Python中的嵌套標籤/表格

回答

相關問題