用正則表達式解析Python 2.7中的html - 真的不明白

對不起，有點笨，但我真的需要Python的幫助。用正則表達式解析Python 2.7中的html - 真的不明白

['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']

所以我有這樣的元組，而我需要切出那是什麼href屬性內，裏面有什麼<a>標籤 - 基本上，我希望得到一個元組，看起來像：

[["needs to be cut out", "Foo to BAR"], ["this also needs to be cut out", "BAR to Foo"]]

內href屬性有很多，例如特殊符號，

<a href="?a=p.stops&amp;direction_id=23600&amp;interval=1&amp;t=wml&amp;l=en">

正如我認爲，有一個在使用HTML解析器太麻煩了，如果我真的不需要嘗試解析對象樹，但只需要網頁中的幾個url和單詞。但我無法真正理解如何形成正則表達式。我形成的正則表達式似乎完全錯誤。所以我問是否有人可以幫助我。

來源

2012-12-26 Арсений Пичугин

無論如何只要使用HTML解析器即可。 Python提供了一些包括在內，xml.etree.ElementTree API更容易獲得比正則表達式工作，甚至簡單<a>標籤任意屬性：

from xml.etree import ElementTree as ET 

texts = [] 
for linktext in linkslist: 
    link = ET.fromstring(linktext) 
    texts.append([link.attrib['href'], link.text])

如果使用' '.join(link.itertext())你可以得到文字出來什麼的嵌套在<a>標籤下，如果你發現一些鏈接嵌套<span>，<b>，<i>或其他內嵌標籤鏈接文本進一步標記：

for linktext in linkslist: 
    link = ET.fromstring(linktext) 
    texts.append([link.attrib['href'], ' '.join(link.itertext())])

這給：

>>> from xml.etree import ElementTree as ET 
>>> linkslist = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']  
>>> texts = [] 
>>> for linktext in linkslist: 
...  link = ET.fromstring(linktext) 
...  texts.append([link.attrib['href'], ' '.join(link.itertext())]) 
... 
>>> texts 
[['needs to be cut out', 'Foo to BAR'], ['this also needs to be cut out', 'BAR to Foo']]

來源

2012-12-26 19:44:12

可以使用BeautifulSoup解析的HTML實體。

根據您的問題，您已經有了下面的列表：

l = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']

現在，所有你需要的是下面的代碼。

from BeautifulSoup import BeautifulSoup 

parsed_list = [] 

for each in l: 
    soup = BeautifulSoup(each) 
    parsed_list.append([soup.find('a')['href'], soup.find('a').contents[0]])

希望它能幫助:)

來源

2012-12-27 05:14:20 Somesh

我會用簡單的HTML解析器EHP了點。

退房https://github.com/iogf/ehp

lst = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>', '<a href="?a=p.stops&amp;direction_id=23600&amp;interval=1&amp;t=wml&amp;l=en">'] 

data = [(tag.text(), attr.get('href'))for indi in lst 
      for tag, name, attr in Html().feed(indi).walk() if attr.get('href')] 


data

輸出：

[('Foo to BAR', 'needs to be cut out'), ('BAR to Foo', 'this also needs to be cut out'), ('', u'?a=p.stops&direction_id=23600&interval=1&t=wml&l=en')]

來源

2016-03-20 10:17:31

用正則表達式解析Python 2.7中的html - 真的不明白

回答

相關問題