2012-12-26 38 views
0

對不起,有點笨,但我真的需要Python的幫助。用正則表達式解析Python 2.7中的html - 真的不明白

['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>'] 

所以我有這樣的元組,而我需要切出那是什麼href屬性內,裏面有什麼<a>標籤 - 基本上,我希望得到一個元組,看起來像:

[["needs to be cut out", "Foo to BAR"], ["this also needs to be cut out", "BAR to Foo"]] 

內href屬性有很多,例如特殊符號,

<a href="?a=p.stops&amp;direction_id=23600&amp;interval=1&amp;t=wml&amp;l=en"> 

正如我認爲,有一個在使用HTML解析器太麻煩了,如果我真的不需要嘗試解析對象樹,但只需要網頁中的幾個url和單詞。但我無法真正理解如何形成正則表達式。我形成的正則表達式似乎完全錯誤。所以我問是否有人可以幫助我。

回答

1

無論如何只要使用HTML解析器即可。 Python提供了一些包括在內,xml.etree.ElementTree API更容易獲得比正則表達式工作,甚至簡單<a>標籤任意屬性:

from xml.etree import ElementTree as ET 

texts = [] 
for linktext in linkslist: 
    link = ET.fromstring(linktext) 
    texts.append([link.attrib['href'], link.text]) 

如果使用' '.join(link.itertext())你可以得到文字出來什麼的嵌套在<a>標籤下,如果你發現一些鏈接嵌套<span><b><i>或其他內嵌標籤鏈接文本進一步標記:

for linktext in linkslist: 
    link = ET.fromstring(linktext) 
    texts.append([link.attrib['href'], ' '.join(link.itertext())]) 

這給:

>>> from xml.etree import ElementTree as ET 
>>> linkslist = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']  
>>> texts = [] 
>>> for linktext in linkslist: 
...  link = ET.fromstring(linktext) 
...  texts.append([link.attrib['href'], ' '.join(link.itertext())]) 
... 
>>> texts 
[['needs to be cut out', 'Foo to BAR'], ['this also needs to be cut out', 'BAR to Foo']] 
1

可以使用BeautifulSoup解析的HTML實體。

根據您的問題,您已經有了下面的列表:

l = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>'] 

現在,所有你需要的是下面的代碼。

from BeautifulSoup import BeautifulSoup 

parsed_list = [] 

for each in l: 
    soup = BeautifulSoup(each) 
    parsed_list.append([soup.find('a')['href'], soup.find('a').contents[0]]) 

希望它能幫助:)

0

我會用簡單的HTML解析器EHP了點。

退房https://github.com/iogf/ehp

lst = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>', '<a href="?a=p.stops&amp;direction_id=23600&amp;interval=1&amp;t=wml&amp;l=en">'] 

data = [(tag.text(), attr.get('href'))for indi in lst 
      for tag, name, attr in Html().feed(indi).walk() if attr.get('href')] 


data 

輸出:

[('Foo to BAR', 'needs to be cut out'), ('BAR to Foo', 'this also needs to be cut out'), ('', u'?a=p.stops&direction_id=23600&interval=1&t=wml&l=en')]