Python正則表達式提取標籤內的html文件內容

-2

我在文件夾中有很多html格式文件。我需要檢查它們是否包含這個標籤：Python正則表達式提取標籤內的html文件內容

<strong>QQ</strong>

而且只需要提取「QQ」及其內容。我首先閱讀其中一個要測試的文件，但看起來我的正則表達式不匹配。如果我取代fo_read作爲標籤

<strong>QQ</strong>

它將雖然相匹配。

fo = open('4251-fu.html', "r") 
fo_read = fo.read() 
m = re.search('<strong>(QQ)</strong>', fo_read) 
if m: 
    print 'Match found: ', m.group(1) 
else: 
    print 'No match' 
fo.close()

來源

2017-05-28 Michael Lin

你有使用HTML解析器，而不是考慮？ [使用正則表達式來解析HTML是可怕的]（https://stackoverflow.com/a/1732454/5067311）。 –

我有beautifulsoup，但在html中有幾個強大的標籤。它如何工作？ –

如果您有多個標籤，而不是使用HTML解析器的另一個原因。我不熟悉這個主題，但是BS4文檔或[標準html模塊]（https://docs.python.org/3/library/html.parser.html）（oops：[python2 for you] （https://docs.python.org/2/library/htmlparser.html））文檔和一些有針對性的谷歌搜索應該是有幫助的。 –

result = soup.find("strong", string=re.compile("Question-and-Answer Session")) 
if result: 
    print("Question-and-Answer Session") 
    # for the rest of text in the parent 
    rest = result.parent.text.split("Question-and-Answer Session")[-1].strip() 
    print(rest) 
else: 
    print("no match")

來源

2017-05-28 01:03:54 Serge

它返回[u'\ n問題和答案會話\ n']，我怎樣才能得到問答會話？ –

你可以在'result.parent.text.split（...）[ - 1]'末尾添加一個'.strip（）'' –

splitting有點怪異，對於任何嚴重的項目都可以嘗試'next_sibling' .. .. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways – Serge

你可以用BeautifulSoup嘗試：

from bs4 import BeautifulSoup 
f = open('4251-fu.html',mode = 'r') 
soup = BeautifulSoup(f, 'lxml') 
search_result = [str(e) for e in soup.find_all('strong')] 
print search_result 
if '<strong>Question-and-Answer Session</strong>' in search_result: 
    print 'Match found' 
else: 
    print 'No match' 
f.close()

輸出：

['<strong>Question-and-Answer Session1</strong>', '<strong>Question-and-Answer Session</strong>', '<strong>Question-and-Answer Session3</strong>'] 
Match found

來源

2017-05-28 00:52:03

有幾個強大的標籤，但我只希望有問答環節 –

但強標籤在不同的地方，並不總是在開始。 –

它會在html文件中找到所有強標記，無論它在哪裏。 –

Python正則表達式提取標籤內的html文件內容

回答

相關問題