正則表達式，這個RegEx有什麼問題？

-2

首先，我對這種可怕的questiontitle抱歉，但我不能想出一個更好的。正則表達式，這個RegEx有什麼問題？

所以我試圖用Python來構建一個小工具，以提高自己的技能，它刮掉數據從Imdb.com和輸出標題和來自HTML過濾其他的東西。

我正在使用此正則表達式進行我的搜索：<h3 class="findSectionHeader"><a name="tt"><\/a>Titles<\/h3>[\s]{0,3}(.*?)<\/td> <\/tr><\/table>這應該會導致a>Titles<\/h3>之後和<\/tr><\/table>之前的所有內容，但我做錯了什麼。我已經加入了[\ S] {0,3}，因爲我認爲這可能是因爲\ n或別的東西，但它並沒有解決它。

這是源塊：

<div class="findSection"> 
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3> 
<table class="findList"> 
<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" > 
<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" /> 
</a> </td> <td class="result_text"> 
<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" > 
<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> 
<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a> 
</td> <td class="result_text"> 
<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) </td> </tr></table>

來源

2017-03-02 Zesa Rex

不要試圖用正則表達式來處理HTML，改用DOM解析器。 [Beautifulsoup]（https://www.crummy.com/software/BeautifulSoup/bs4/doc/）應該是一個蟒良好的起點。 –

問題是你的'。*？'不符合換行符。如果啓用單行模式's'，它會按預期工作。 –

@rawing啊，不用，它也使用作品的時候'（[\ S \ S] *？）'任何字符，空格藏漢匹配非空白字符！謝謝 –

嘗試使用以下正則表達式：

(?s)(?<=<\/h3>\n).*?(?=</tr></table>)

看到regex demo/explanation

蟒

import re 
regex = r"(?s)(?<=<\/h3>\n).*?(?=</tr></table>)" 
str = """<div class="findSection"> 
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3> 
<table class="findList"> 
<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" > 
<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" /> 
</a> </td> <td class="result_text"> 
<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" > 
<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> 
<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a> 
</td> <td class="result_text"> 
<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) </td> </tr></table>""" 
matches = re.finditer(regex, str) 
for matchNum, match in enumerate(matches): 
    matchNum = matchNum + 1 
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

來源

2017-03-02 13:49:22 m87

您可以在標誌re.DOTALL添加到您的通話re使.匹配換行符：

src = '''<div class="findSection"> 
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3> 
<table class="findList"> 
<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" > 
<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" /> 
</a> </td> <td class="result_text"> 
<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" > 
<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> 
<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a> 
</td> <td class="result_text"> 
<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) </td> </tr></table>''' 

expr = r'<h3 class="findSectionHeader"><a name="tt"><\/a>Titles<\/h3>[\s]{0,3}(.*?)<\/td> <\/tr><\/table>' 

import re 

print re.findall(expr, src, re.DOTALL)

產量：

['<table class="findList">\n<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" >\n<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" />\n</a> </td> <td class="result_text"> \n<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" >\n<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> \n<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a>\n</td> <td class="result_text"> \n<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) ']

來源

2017-03-02 13:57:11

其實，這是我昨天已經試過這樣：'結果= re.findall（r'REGEX」，STR（結果），旗幟= re.DOTALL）'，但它沒有工作，也許我失敗了。 –

正則表達式，這個RegEx有什麼問題？

回答

相關問題