2017-04-09 45 views
1

我試圖使用BeautifulSoup來刮取以下頁面(例如1,2)以獲取從曼谷的一個地方到另一個地方的行動列表。BeautifulSoup獲取給定標籤後的所有鏈接

基本上,我可以查詢並選擇旅行的描述如下。

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en' 
route_request = requests.get(url) 
soup_route = BeautifulSoup(route_request.content, 'lxml') 
descriptions = soup_route.find('div', attrs={'id': 'routeDescription'}) 

descriptions的HTML看起來像下面

<div id="routeDescription"> 
... 
<br/> 
<img src="/images/walk_icon_small.PNG" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Walk by foot to <b>Sanam Luang</b> 
<br/> 
<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/> 
... 
</div> 

基本上,我試圖讓行動和公交線路列表,行駛到下一個位置(問題的答案更新,但仍然沒」 t解決)。

route_descrtions = [] 
for description in descriptions.find_all('img'): 
    action = description.next_sibling 
    to_station = action.next_sibling 
    n = action.find_next_siblings('a') 
    if 'travel' in action.lower(): 
     lines = [to_station.find_next('b').text] + [a.contents[0] for a in n] 
    else: 
     lines = [] 
    desp = {'action': action, 
      'to': to_station.text, 
      'lines': lines} 
    route_descrtions.append(desp) 

不過,我不知道如何通過鏈接循環的每個動作(Travel to行動)之後,並追加到我的名單。我試過find_next('a')find_next_siblings('a'),但沒有完成我的任務。

輸出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat', '40', '48', '501', '508'], 
    'to': 'Si Phraya'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'}, 
{'action': 'Travel to ', 
    'lines': ['16', '40', '48', '501', '508'], 
    'to': 'Siam'}, 
{'action': 'Travel to ', 
    'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'], 
    'to': 'Asok'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}] 

所需的輸出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat'], 
... 

回答

1

下面應該工作:

from bs4 import BeautifulSoup 
import requests 
import pprint 

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en' 
route_request = requests.get(url) 
soup_route = BeautifulSoup(route_request.content, 'lxml') 
routes = soup_route.find('div', attrs={'id': 'routeDescription'}) 

parsed_routes = list() 
for img in routes.find_all('img'): 
    action = img.next_sibling 
    to_station = action.next_sibling 
    links = list() 
    for sibling in img.next_siblings: 
     if sibling.name == 'a': 
      links.append(sibling) 
     elif sibling.name == 'img': 
      break 

    lines = list() 
    if 'travel' in action.lower(): 
     lines.extend([to_station.find_next('b').text]) 
     lines.extend([link.contents[0] for link in links]) 

    parsed_route = {'action': action, 'to': to_station.text, 'lines': lines} 
    parsed_routes.append(parsed_route) 

pprint.pprint(parsed_routes) 

此輸出:

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat'], 
    'to': 'Si Phraya'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'}, 
{'action': 'Travel to ', 'lines': ['16'], 'to': 'Siam'}, 
{'action': 'Travel to ', 
    'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'], 
    'to': 'Asok'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}] 

你的關鍵問題是n = action.find_next_siblings('a')因爲它得到了在後您的「當前」的形象同級別的所有鏈接。看到所有圖像和所有鏈接都處於同一水平,這不是你想要的。

您可能正在考慮將圖像作爲鏈接的父節點。喜歡的東西:

  • IMG1
    • 鏈接1
  • IMG2
    • 鏈接2
  • IMG3
    • LINK3
    • LINK4
    • link5

然而,在現實中,它更像是以下幾點:

  • IMG1
  • 鏈接1
  • IMG2
  • 鏈接2
  • IMG3
  • LINK3
  • LINK4
  • link5

當你問你有IMG1,IMG2和IMG3圖像(在這個例子中)。當你要求所有下一個鏈接兄弟姐妹你得到了。所以,如果你在IMG2,並要求下一環節的兄弟姐妹,你得到了他們,即

  • IMG1
  • 鏈接1
  • IMG2 <你在這裏,並得到了...
  • 鏈接2 <這
  • IMG3 - (不是這個,因爲它不是一個鏈接)
  • LINK3 <此,
  • LINK4 <這一點,
  • link5 <這

我希望解釋。我所做的改變只是循環,直到找到圖像並停在那裏。因此你的外部圖像循環從那裏繼續。我還清理了一些代碼。只是爲了清楚。

+0

謝謝安德烈!該解決方案適用於我。也感謝您的好解釋。已經接受了答案(並豎起大拇指)! – titipata

0

您可以嘗試find_next_siblings(使用Python 2.7):

import bs4 

text = '''<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/>x`x''' 

soup = bs4.BeautifulSoup(text, 'lxml') 
img = soup.find('img') 
action = img.next_sibling 
to_station = action.next_sibling 
n = to_station.find_next_siblings('a') 
d = { 
    'action': action, 
    'to': to_station.text, 
    'buses': [a.contents[0] for a in n] 
} 

結果:

{'action': u'Travel to ', 'to': u'Khok Wua', 'buses': [u'15', u'44', u'47', u'59', u'201', u'203', u'512']} 
+0

嗨Yohanes,我試過了,但它不適合我的特殊問題。您是否有適用於給定完整HTML的解決方案? – titipata

相關問題