我使用BeautifulSoup從HTML頁面中提取類別和子類別。在HTML看起來像這樣:如何使用BeautifulSoup獲取父級和嵌套值?
<a class='menuitem submenuheader' href='#'>Beverages</a><div class='submenu'><ul><li><a href='productlist.aspx?parentid=053&catid=055'>Juice</a></li></ul></div><a class='menuitem submenuheader' href='#'>DIY</a><div class='submenu'><ul><li><a href='productlist.aspx?parentid=007&catid=052'>Miscellaneous</a></li><li><a href='productlist.aspx?parentid=007&catid=047'>Sockets</a></li><li><a href='productlist.aspx?parentid=007&catid=046'>Spanners</a></li><li><a href='productlist.aspx?parentid=007&catid=045'>Tool Boxes</a></li></ul></div><a class='menuitem submenuheader' href='#'>Electronics</a><div class='submenu'><ul><li><a href='productlist.aspx?parentid=003&catid=019'>Audio/Video</a></li><li><a href='productlist.aspx?parentid=003&catid=027'>Cameras</a></li><li><a href='productlist.aspx?parentid=003&catid=023'>Cookers</a></li><li><a href='productlist.aspx?parentid=003&catid=024'>Freezers</a></li><li><a href='productlist.aspx?parentid=003&catid=025'>Kitchen Appliances</a></li><li><a href='productlist.aspx?parentid=003&catid=048'>Measuring Instruments</a></li><li><a href='productlist.aspx?parentid=003&catid=020'>Microwaves</a></li><li><a href='productlist.aspx?parentid=003&catid=050'>Miscellaneous</a></li><li><a href='productlist.aspx?parentid=003&catid=026'>Personal Care</a></li><li><a href='productlist.aspx?parentid=003&catid=021'>Refrigerators</a></li><li><a href='productlist.aspx?parentid=003&catid=018'>TV</a></li><li><a href='productlist.aspx?parentid=003&catid=022'>Washers/Dryers/Vacuum Cleaners</a></li></ul></div>
其中飲料類和果汁是子類別。
我有下面的代碼的工作來提取類別:
from bs4 import BeautifulSoup
import re
import urllib2
url = "http://www.myprod.com"
def main():
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
categories = soup.findAll("a", {"class" :'menuitem submenuheader'})
for cat in categories:
print cat.contents[0]
我將如何得到這種格式的子類別?
[Beverages = Category]
[Juice = Sub]
[DIY = Category]
[Miscellaneous = Sub]
[Spanners = Sub]
[Sockets = Sub]
[Electronics]
[Audio = Sub]
[Cameras]
這在一定程度上有效,但它給了我所有的嵌套值,而不是每個類別結構。 [Category = Juice] [Sub = MinuteMaid] [Category = Electronics] [Sub = Blenders] – jwesonga 2012-07-07 18:20:45
然後頁面不是很好製作。使用'findNext()'而不是'findParent()'。 – 2012-07-07 18:44:02
真棒,這工作! – jwesonga 2012-07-08 02:57:53