2012-07-07 46 views
0

我使用BeautifulSoup從HTML頁面中提取類別和子類別。在HTML看起來像這樣:如何使用BeautifulSoup獲取父級和嵌套值?

<a class='menuitem submenuheader' href='#'>Beverages</a><div class='submenu'><ul><li><a href='productlist.aspx?parentid=053&catid=055'>Juice</a></li></ul></div><a class='menuitem submenuheader' href='#'>DIY</a><div class='submenu'><ul><li><a href='productlist.aspx?parentid=007&catid=052'>Miscellaneous</a></li><li><a href='productlist.aspx?parentid=007&catid=047'>Sockets</a></li><li><a href='productlist.aspx?parentid=007&catid=046'>Spanners</a></li><li><a href='productlist.aspx?parentid=007&catid=045'>Tool Boxes</a></li></ul></div><a class='menuitem submenuheader' href='#'>Electronics</a><div class='submenu'><ul><li><a href='productlist.aspx?parentid=003&catid=019'>Audio/Video</a></li><li><a href='productlist.aspx?parentid=003&catid=027'>Cameras</a></li><li><a href='productlist.aspx?parentid=003&catid=023'>Cookers</a></li><li><a href='productlist.aspx?parentid=003&catid=024'>Freezers</a></li><li><a href='productlist.aspx?parentid=003&catid=025'>Kitchen Appliances</a></li><li><a href='productlist.aspx?parentid=003&catid=048'>Measuring Instruments</a></li><li><a href='productlist.aspx?parentid=003&catid=020'>Microwaves</a></li><li><a href='productlist.aspx?parentid=003&catid=050'>Miscellaneous</a></li><li><a href='productlist.aspx?parentid=003&catid=026'>Personal Care</a></li><li><a href='productlist.aspx?parentid=003&catid=021'>Refrigerators</a></li><li><a href='productlist.aspx?parentid=003&catid=018'>TV</a></li><li><a href='productlist.aspx?parentid=003&catid=022'>Washers/Dryers/Vacuum Cleaners</a></li></ul></div> 

其中飲料類和果汁是子類別。

我有下面的代碼的工作來提取類別:

from bs4 import BeautifulSoup 
import re 
import urllib2 


url = "http://www.myprod.com" 

def main(): 
    response = urllib2.urlopen(url) 
    html = response.read() 

    soup = BeautifulSoup(html) 
    categories = soup.findAll("a", {"class" :'menuitem submenuheader'}) 
    for cat in categories: 
    print cat.contents[0] 

我將如何得到這種格式的子類別?

[Beverages = Category] 
[Juice = Sub] 
[DIY = Category] 
[Miscellaneous = Sub] 
[Spanners = Sub] 
[Sockets = Sub] 
[Electronics] 
[Audio = Sub] 
[Cameras] 

回答

0

從每個類別HTML中你必須找到下一個元素,並從那裏找到它的li元素:

print cat.findNext().findAll('li') 
+0

這在一定程度上有效,但它給了我所有的嵌套值,而不是每個類別結構。 [Category = Juice] [Sub = MinuteMaid] [Category = Electronics] [Sub = Blenders] – jwesonga 2012-07-07 18:20:45

+0

然後頁面不是很好製作。使用'findNext()'而不是'findParent()'。 – 2012-07-07 18:44:02

+0

真棒,這工作! – jwesonga 2012-07-08 02:57:53

0

考慮到你的HTML總是有那些的div,它可能不如歸去一個列表的類別和另一個子類別的方式,​​對應於subcats[i],或根據你想要什麼,返回一個字典。

在Python外殼:

>>> from BeautifulSoup import BeautifulSoup 
>>> html = '''<a class="menuitem submenuheader" href="#">Beverages</a> 
... <div class="submenu"> 
... <ul> 
... <li><a href="productlist.aspx?parentid=053&amp;catid=055">Juice</a></li> 
... <li><a href="productlist.aspx?parentid=053&amp;catid=055">Milk</a></li> 
... </ul> 
... </div> 
... <a class="menuitem submenuheader" href="#">DIY</a> 
... <div class="submenu"> 
... <ul> 
... <li><a href="productlist.aspx?parentid=053&amp;catid=055">Micellaneous</a></li> 
... <li><a href="productlist.aspx?parentid=053&amp;catid=055">Spanners</a></li> 
... <li><a href="productlist.aspx?parentid=053&amp;catid=055">Sockets</a></li> 
... </ul> 
... </div>''' 
>>> soup = BeautifulSoup(html) 
>>> categories = soup.findAll("a", {"class": 'menuitem submenuheader'}) 
>>> cats = [cat.text for cat in categories] 
>>> sub_menus = soup.findAll("div", {"class": "submenu"}) 
>>> subcats = [] 
>>> for menu in sub_menus: 
...  subcat = [item.text for item in menu.findAll('li')] 
...  subcats.append(subcat) 
... 
>>> print cats 
[u'Beverages', u'DIY'] 
>>> print subcats 
[[u'Juice', u'Milk'], [u'Micellaneous', u'Spanners', u'Sockets']] 
>>> cat_dict = dict(zip(cats,subcats)) 
>>> print cat_dict 
{u'Beverages': [u'Juice', u'Milk'], u'DIY': [u'Micellaneous', u'Spanners', u'Sockets']} 
0

看相關的網頁,它看起來像所有的newstories在h3標籤與類item-heading。您可以使用BeautifulSoup來選擇所有的故事標題,然後向上一步來訪問,他們被包裹在a href

In [54]: [i.parent.attrs["href"] for i in soup.select('a > h3.item-heading')] 
Out[55]: 
[{'href': '/news/us-news/civil-rights-groups-fight-trump-s-refugee-ban-uncertainty-continues-n713811'}, 
{'href': '/news/us-news/protests-erupt-nationwide-second-day-over-trump-s-travel-ban-n713771'}, 
{'href': '/politics/politics-news/some-republicans-criticize-trump-s-immigration-order-n713826'}, 
... # trimmed for readability 
] 

我用一個列表理解,但你可以拆分到複合步驟:

# select all `h3` tags with the matching class that are contained within an `a` link. 
# This excludes any random links elsewhere on the page. 
story_headers = soup.select('a > h3.item-heading') 

# Iterate through all the matching `h3` items and access their parent `a` tag. 
# Then, within the parent you have access to the `href` attribute. 
list_of_links = [i.parent.attrs for i in story_headers] 

# Finally, extract the links into a tidy list 
links = [i["href"] for i in list_of_links] 

一旦你有了鏈接列表,你可以遍歷它來檢查第一個字符是否爲/,以僅匹配本地鏈接而不匹配外部鏈接。

相關問題