Python字符串範圍（解析HTML）

在蟒蛇我抄一個網頁，並希望得到的<a href=Python字符串範圍（解析HTML）

所有出現我使用的urllib2和我的設置如下：

import urllib2 
response = urllib2.urlopen("http://python.org") 
html = response.read()

會是什麼處理這個任務的最好方法是什麼？我將如何從已存儲整個網頁的變量中選擇一系列字符串文本？

來源

2011-07-03 nobody

+1 *不*提的正則表達式:-) – Johnsyweb

是美麗的湯還能夠發現電子郵件地址，電話號碼等的？ – nobody

呃 - 哦！你想達到什麼目的？ – Johnsyweb

對於在Python中解析HTML，我更喜歡BeautifulSoup。這是假設你想要找到鏈接，而不僅僅是文字<a href=，你可以很容易地通過字符串搜索。

來源

2011-07-03 00:17:47

謝謝，這將使它更容易;） – nobody

哇; BeautifulSoup是驚人的 – nobody

沒問題。高興地幫助:) –

聽起來像你需要一個HTML解析器。看看Beautiful Soup。我不會使用正則表達式，它會非常混亂，並且容易出錯。

來源

2011-07-03 00:18:59 Mike

感謝，這將使它更容易;） – nobody

你可以例如使用正則表達式匹配HTML鏈接或子類Python的內建SGML解析器：

from sgmllib import SGMLParser 

class URLExtractor(SGMLParser): 
    def reset(self): 
     SGMLParser.reset(self) 
     self.urls = [] 

    def start_a(self, attrs): 
     for name, value in attrs: 
      if name == 'href': 
       self.urls.append(value)

你會使用它這樣的：

extractor = URLExtractor() 
extractor.feed(html) 
print extractor.urls

來源

2011-07-03 00:19:52 jena

這是Beautiful Soup工作當然：

>>> from BeautifulSoup import BeautifulSoup 
>>> import urllib2 
>>> page = urllib2.urlopen('http://stackoverflow.com/') 
>>> soup = BeautifulSoup(page) 
>>> links = soup.html.body.findAll('a', limit=10) 
>>> for i, link in enumerate(links): 
...  print i, ':', link.text, ' -- ', link['href'] 
... 
0 : Stack Exchange -- http://stackexchange.com 
1 : log in -- /users/login 
2 : blog -- http://blog.stackoverflow.com 
3 : careers -- http://careers.stackoverflow.com 
4 : chat -- http://chat.stackoverflow.com 
5 : meta -- http://meta.stackoverflow.com 
6 : about -- /about 
7 : faq -- /faq 
8 : Stack Overflow --/
9 : Questions -- /questions

該frontpag上有很多鏈接Ë;我已經將輸出限制在前十位！

來源

2011-07-03 00:25:00 Johnsyweb

美麗的湯的另一個+1。也就是說，如果你真的想要一個簡單的解析器，你可以使用正則表達式搜索。

>>> import urllib2 
>>> response = urllib2.urlopen("http://python.org") 
>>> html = response.read() 

>>> import re 
>>> re.findall("<a[^>]*href=[^>]*>", html)

注意：更新正則表達式來進行更精確的基於評論

來源

2011-07-03 00:32:02 shreddd

鏈接如'？ – Johnsyweb

當然 - 上面的例子大部分是第一遍。我相信我錯過了一些邊緣情況。你可以做re.findall（「」* href = [^>] *>「，html）來更準確。再說一遍 - 無論如何，美麗的湯可能是更好的解決方案。 – shreddd

Python字符串範圍（解析HTML）

回答

相關問題