正則表達式來提取標記及其內容

考慮這個：正則表達式來提取標記及其內容

input = """Yesterday<person>Peter</person>drove to<location>New York</location>"""

一個如何使用正則表達式模式來提取：

person: Peter 
location: New York

這個效果很好，但我不想硬編碼的標籤，他們可以改變：

print re.findall("<person>(.*?)</person>", input) 
print re.findall("<location>(.*?)</location>", input)

來源

2014-03-24 DevEx

你越來越接近危險的http://stackoverflow.com/a/1732454/3001761 – jonrsharpe

@DevEx請參閱修改的答案 – PyNEwbie

使用專爲工作而設計的工具。我碰巧喜歡LXML但他們的其他

>>> minput = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>""" 
>>> from lxml import html 
>>> tree = html.fromstring(minput) 
>>> for e in tree.iter(): 
     print e, e.tag, e.text_content() 
     if e.tag() == 'person':   # getting the last name per comment 
      last = e.text_content().split()[-1] 
      print last 


<Element p at 0x3118ca8> p YesterdayPeter Smithdrove toNew York 
<Element person at 0x3118b48> person Peter Smith 
Smith           # here is the last name 
<Element location at 0x3118ba0> location New York

如果你是新來的Python，那麼你可能要訪問此site得到一個安裝了大量的工具包，包括LXML。

來源

2014-03-24 19:59:48 PyNEwbie

+1浪費時間的戰鬥正則表達式：0秒 – slezica

感謝@PyNEwbie在「案例 Peter Smith'我怎樣才能用'text_content（）'來提取只有'Smith'？ – DevEx

你不能，但你可以分割字符串，一旦你有它。 – PyNEwbie

避免使用正則表達式解析HTML，而是使用HTML解析器。

下面是使用BeautifulSoup一個例子：

from bs4 import BeautifulSoup  

data = "Yesterday<person>Peter</person>drove to<location>New York</location>" 
soup = BeautifulSoup(data) 

print 'person: %s' % soup.person.text 
print 'location: %s' % soup.location.text

打印：

person: Peter 
location: New York

注意代碼的簡單性。

希望有所幫助。

來源

2014-03-24 20:01:41 alecxe

現在，這是更好的:) – Jerry

正則表達式來提取標記及其內容

回答

相關問題