2016-09-30 42 views
2

我試圖抓取https://www.wellstar.org/locations/pages/default.aspx的位置數據,當我查看源代碼時,我注意到醫院地址的類有時拼寫有額外的'd' - 'adddress'和'address' 。有沒有辦法來解決以下代碼中的這種差異?我試圖加入一個if語句來測試address對象的長度,但我只能得到與'adddress'類關聯的地址。我覺得我很接近但沒有想法。BeautifulSoup - 拼錯類

import urllib 
import urllib.request 
from bs4 import BeautifulSoup 
import re 

def make_soup(url): 
    thepage = urllib.request.urlopen(url) 
    soupdata = BeautifulSoup(thepage,"html.parser") 
    return soupdata 

soup = make_soup("https://www.wellstar.org/locations/pages/default.aspx") 

for table in soup.findAll("table",class_="s4-wpTopTable"): 
    for type in table.findAll("h3"): 
     type = type.get_text() 
    for name in table.findAll("div",class_="PurpleBackgroundHeading"): 
     name = name.get_text() 
    address="" 
    for address in table.findAll("div",class_="WS_Location_Adddress"): 
      address = address.get_text(separator=" ") 
    if len(address)==0: 
     for address in table.findAll("div",class_="WS_Location_Address"): 
      address = address.get_text(separator = " ") 
      print(type, name, address) 

回答

2

BeautifulSoup爲適應大,你可以使用正則表達式:

for address in table.find_all("div", class_=re.compile(r"WS_Location_Ad{2,}ress")): 

其中d{2,}將匹配d 2倍以上。


或者,你可以指定一個類的列表

for address in table.find_all("div", class_=["WS_Location_Address", "WS_Location_Adddress"]): 
+0

兩個很好的選擇 - 我很好奇/正則表達式嚇倒,是誠實的。這可能是花點時間學習操作員的理由。 – Daniel