BeautifulSoup獲取給定標籤後的所有鏈接

我試圖使用BeautifulSoup來刮取以下頁面（例如1,2）以獲取從曼谷的一個地方到另一個地方的行動列表。BeautifulSoup獲取給定標籤後的所有鏈接

基本上，我可以查詢並選擇旅行的描述如下。

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en' 
route_request = requests.get(url) 
soup_route = BeautifulSoup(route_request.content, 'lxml') 
descriptions = soup_route.find('div', attrs={'id': 'routeDescription'})

的descriptions的HTML看起來像下面

<div id="routeDescription"> 
... 
<br/> 
<img src="/images/walk_icon_small.PNG" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Walk by foot to <b>Sanam Luang</b> 
<br/> 
<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/> 
... 
</div>

基本上，我試圖讓行動和公交線路列表，行駛到下一個位置（問題的答案更新，但仍然沒」 t解決）。

route_descrtions = [] 
for description in descriptions.find_all('img'): 
    action = description.next_sibling 
    to_station = action.next_sibling 
    n = action.find_next_siblings('a') 
    if 'travel' in action.lower(): 
     lines = [to_station.find_next('b').text] + [a.contents[0] for a in n] 
    else: 
     lines = [] 
    desp = {'action': action, 
      'to': to_station.text, 
      'lines': lines} 
    route_descrtions.append(desp)

不過，我不知道如何通過鏈接循環的每個動作（Travel to行動）之後，並追加到我的名單。我試過find_next('a')和find_next_siblings('a')，但沒有完成我的任務。

輸出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat', '40', '48', '501', '508'], 
    'to': 'Si Phraya'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'}, 
{'action': 'Travel to ', 
    'lines': ['16', '40', '48', '501', '508'], 
    'to': 'Siam'}, 
{'action': 'Travel to ', 
    'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'], 
    'to': 'Asok'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}]

所需的輸出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat'], 
...

來源

2017-04-09 titipata

下面應該工作：

from bs4 import BeautifulSoup 
import requests 
import pprint 

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en' 
route_request = requests.get(url) 
soup_route = BeautifulSoup(route_request.content, 'lxml') 
routes = soup_route.find('div', attrs={'id': 'routeDescription'}) 

parsed_routes = list() 
for img in routes.find_all('img'): 
    action = img.next_sibling 
    to_station = action.next_sibling 
    links = list() 
    for sibling in img.next_siblings: 
     if sibling.name == 'a': 
      links.append(sibling) 
     elif sibling.name == 'img': 
      break 

    lines = list() 
    if 'travel' in action.lower(): 
     lines.extend([to_station.find_next('b').text]) 
     lines.extend([link.contents[0] for link in links]) 

    parsed_route = {'action': action, 'to': to_station.text, 'lines': lines} 
    parsed_routes.append(parsed_route) 

pprint.pprint(parsed_routes)

此輸出：

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat'], 
    'to': 'Si Phraya'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'}, 
{'action': 'Travel to ', 'lines': ['16'], 'to': 'Siam'}, 
{'action': 'Travel to ', 
    'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'], 
    'to': 'Asok'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}]

你的關鍵問題是n = action.find_next_siblings('a')因爲它得到了在後您的「當前」的形象同級別的所有鏈接。看到所有圖像和所有鏈接都處於同一水平，這不是你想要的。

您可能正在考慮將圖像作爲鏈接的父節點。喜歡的東西：

IMG1
- 鏈接1
IMG2
- 鏈接2
IMG3
- LINK3
- LINK4
- link5

然而，在現實中，它更像是以下幾點：

IMG1
鏈接1
IMG2
鏈接2
IMG3
LINK3
LINK4
link5

當你問你有IMG1，IMG2和IMG3圖像（在這個例子中）。當你要求所有下一個鏈接兄弟姐妹你得到了。所以，如果你在IMG2，並要求下一環節的兄弟姐妹，你得到了他們，即

IMG1
鏈接1
IMG2 <你在這裏，並得到了...
鏈接2 <這
IMG3 - （不是這個，因爲它不是一個鏈接）
LINK3 <此，
LINK4 <這一點，
link5 <這

我希望解釋。我所做的改變只是循環，直到找到圖像並停在那裏。因此你的外部圖像循環從那裏繼續。我還清理了一些代碼。只是爲了清楚。

來源

2017-04-09 21:37:23

謝謝安德烈！該解決方案適用於我。也感謝您的好解釋。已經接受了答案（並豎起大拇指）！ – titipata

您可以嘗試find_next_siblings（使用Python 2.7）：

import bs4 

text = '''<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/>x`x''' 

soup = bs4.BeautifulSoup(text, 'lxml') 
img = soup.find('img') 
action = img.next_sibling 
to_station = action.next_sibling 
n = to_station.find_next_siblings('a') 
d = { 
    'action': action, 
    'to': to_station.text, 
    'buses': [a.contents[0] for a in n] 
}

結果：

{'action': u'Travel to ', 'to': u'Khok Wua', 'buses': [u'15', u'44', u'47', u'59', u'201', u'203', u'512']}

來源

2017-04-09 04:30:29

嗨Yohanes，我試過了，但它不適合我的特殊問題。您是否有適用於給定完整HTML的解決方案？ – titipata

BeautifulSoup獲取給定標籤後的所有鏈接

回答

相關問題