複雜的正則表達式來提取python作者名稱

我想創建一個正則表達式相當不成功，我正在做的是獲取任何html元素的內容（作者| byline |作家）複雜的正則表達式來提取python作者名稱

這裏是我迄今爲止

<([A-Z][A-Z0-9]*)class=\"(byLineTag|byline|author|by)\"[^>]*>(.*?)</\1>

什麼，我需要匹配

<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>

或

例子210

<div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>

任何幫助將不勝感激。 -Stefan

來源

2011-07-04 Stefan Harris

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 – MitMaro

不要這樣做。請參閱MitMaro的鏈接。想象一下像'

hello world

another block

'這樣的東西。它不能做到。 HTML不是一種常規語言。使用適當的解析器。 –

您可以發佈一些示例輸入和預期的輸出。 – Stephan

試試這個：

<([A-Z][A-Z0-9]*).*?class=\"(byLineTag|byline|author|by)\"[^>]*?>(.*?)</\1>

我有什麼補充說：
- *？，以防類屬性不會出現在開始標籤之後。
- *？，設置*運營商非貪婪尋找收盤>

來源

2011-07-04 22:04:42 Stephan

感謝您的及時響應，這對我的第一個例子，但不是第二個。 –

我在正則表達式的結尾添加了一個小的增強功能，可以嘗試使用它 – Stephan

你忘了考慮標記名稱和第一屬性名稱之間的空間。另外，除非您確定class始終是第一個屬性，否則您應該在表達式中考慮相反的情況。此外，如果你真的關心結束標記，那麼\ 1應該是\ 0（反向引用是零索引的）。正如我在評論中指出的那樣，您還應該在通配符中包含小寫字符。

這裏是一個更好的表達（我已經不顧結束標記，使其更簡單）：

<[A-Za-z][A-Za-z0-9]*.*? class=["'](byLineTag|byline|author|by)["'][^>]*>

Remeber先一起運行的所有行，以避免發生錯誤時，標籤被跨越幾行拆分。當然，如果你使用Python的HTML解析器，你可能會節省很多工作。

來源

2011-07-04 22:29:12 jforberg

謝謝，但這並不能捕獲標記的內容。 –

HTMLParser是你的朋友。 – jforberg

正則表達式並不是特別適合解析HTML。謝天謝地，還有一些專門爲解析HTML而創建的工具，例如BeautifulSoup和lxml;其中後者被證明如下：

markup = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6><div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>''' 

import lxml.html 

import lxml.html 
doc = lxml.html.fromstring(markup) 
for a in doc.cssselect('.author, .by, .byline, .byLineTag'): 
    print a.text_content() 
# By JACK EWING and LANDON THOMAS Jr. 
# By 
# Sarah Shemkus

來源

2011-07-04 22:29:32 bernie

+1爲使用CSS選擇器的替代方法。我一定錯過了.cssselect（） –

使用正則表達式解析爲已經提到的原因，HTML強烈建議不。使用現有的HTML解析器。作爲一個簡單的例子，我已經包含了一個使用lxml和它的CSS選擇器的例子。

from lxml import etree 
from lxml.cssselect import CSSSelector 

## Your html string 
html_string = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>''' 

## lxml html parser 
html = etree.HTML(html_string) 

## lxml CSS selector 
sel = CSSSelector('.author, .byline, .writer') 

## Call the selector to get matches 
matching_elements = sel(html) 

for elem in matching_elements: 
    primt elem.text

來源

2011-07-04 22:30:34

複雜的正則表達式來提取python作者名稱

回答

相關問題