找到一個段落，找到這一段用正則表達式

內的字符串我有一個HTML頁面內一些線路是這樣的：找到一個段落，找到這一段用正則表達式

<div> 
    <p class="match"> this sentence should match </p> 
    some text 
    <a class="a"> some text </a> 
</div> 
<div> 
    <p class="match"> this sentence shouldnt match</p> 
    some text 
    <a class ="b"> some text </a> 
</div>

我想提取<p class="match">裏面的線，但只有當裏面有div含<a class="a">。

什麼，我這樣做的遠低於（我首先找到<a class="a">裏面的段落，我在迭代結果找到一個<p class="match">裏面的句子）：

import re 
file_to_r = open("a") 

regex_div = re.compile(r'<div>.+"a".+?</div>', re.DOTALL) 

regex_match = re.compile(r'<p class="match">(.+)</p>') 
for m in regex_div.findall(file_to_r.read()): 
    print(regex_match.findall(m))

，但我不知道是否有另一種（仍然有效）的方式一次做到這一點？

來源

2014-08-28 Simon

嘗試美麗湯4解析HTML文件.. – 2014-08-28 17:04:48

http://stackoverflow.com/a/1732454 – carloabelli 2014-08-28 17:04:54

你應該使用一個HTML解析器，但如果你仍然笏正則表達式，你可以使用這樣的事情：

<div>\s*<p class="match">([\w\s]+)</p>[\w\s]+(?=<a class="a").*?</div>

Working demo

enter image description here

來源

2014-08-28 17:20:38

[not really reliable ...]（http://regex101.com/r/pV9qY8/2）。 – Jerry 2014-08-28 17:22:29

@Jerry，因爲我在我的答案建議我不會使用正則表達式來解析HTML。但我發佈了答案作爲使用正則表達式回答問題的選項。 – 2014-08-28 17:30:30

使用HTML解析器，如BeautifulSoup。

找到a標籤與a類，然後find previous sibling - p與match類標籤：

from bs4 import BeautifulSoup 

data = """ 
<div> 
    <p class="match"> this sentence should match </p> 
    some text 
    <a class="a"> some text </a> 
</div> 
<div> 
    <p class="match"> this sentence shouldn't match</p> 
    some text 
    <a class ="b"> some text </a> 
</div> 
""" 

soup = BeautifulSoup(data) 
a = soup.find('a', class_='a') 
print a.find_previous_sibling('p', class_='match').text

打印：

this sentence should match

也明白爲什麼你應該避免使用正則表達式這裏解析HTML：

RegEx match open tags except XHTML self-contained tags

來源

2014-08-28 17:06:14 alecxe

@ user3683807請仔細閱讀所鏈接的話題 - HTML解析器正在解析HTML明確提出 - 針對特定任務的特定工具。我建議在這裏使用'BeautifulSoup' - 它使HTML解析變得簡單可靠。 – alecxe 2014-08-28 17:19:14

<div>\s*\n\s*.*?<p class=.*?>(.*?)<\/p>\s*\n\s*.*?\s*\n\s*(?=(\<a class=\"a\"\>))

你可以使用這個。

查看演示。

http://regex101.com/r/lK9iD2/7

來源

2014-08-28 17:34:45 vks

找到一個段落，找到這一段用正則表達式

回答

相關問題