蟒蛇提取HTML標籤屬性沒有正則表達式

有什麼辦法使用urlib,urllib2或BeautifulSoup提取HTML標籤屬性？蟒蛇提取HTML標籤屬性沒有正則表達式

例如：

<a href="xyz" title="xyz">xyz</a>

得到href=xyz, title=xyz

還有另外一個線程談論使用regular expressions

感謝

來源

2011-08-21 daydreamer

你提這個封面很thorooughly的BeautifulSoup的文檔。如果您遇到了一些具體問題，那麼您需要在問題中更具體。 –

可能的重複[如何迭代美麗的湯元素的HTML屬性？]（http://stackoverflow.com/questions/822571/how-do-i-iterate-over-the-html-attributes-of -a-美麗的湯元） – agf

你可以使用BeautifulSoup解析HTML，併爲每個<a>標記，請使用tag.attrs來讀取屬性：

In [111]: soup = BeautifulSoup.BeautifulSoup('<a href="xyz" title="xyz">xyz</a>') 

In [112]: [tag.attrs for tag in soup.findAll('a')] 
Out[112]: [[(u'href', u'xyz'), (u'title', u'xyz')]]

來源

2011-08-21 22:04:50 unutbu

爲什麼不嘗試使用HTMLParser模塊？

事情是這樣的：

import HTMLParser 
import urllib 

class parseTitle(HTMLParser.HTMLParser): 

    def handle_starttag(self, tag, attrs): 
     if tag == 'a': 
      for names, values in attrs: 
       if name == 'href': 
        print value # or the code you need. 
       if name == 'title': 
        print value # or the code you need. 



aparser = parseTitle() 
u = urllib.open('http://stackoverflow.com') # change the address as you like 
aparser.feed(u.read())

來源

2011-08-22 19:25:47

蟒蛇提取HTML標籤屬性沒有正則表達式

回答

相關問題