Python的正則表達式找不到子字符串，但它應該

我想解析HTML使用BeautifulSoup嘗試和提取網頁標題。有時候，由於網站寫得很糟糕，如Bad End標籤，這是行不通的。當這不工作，我去手動正則表達式Python的正則表達式找不到子字符串，但它應該

我有文字

<html xmlns="http://www.w3.org/1999/xhtml"\n  xmlns:og="http://ogp.me/ns#"\n  xmlns:fb="https://www.facebook.com/2008/fbml">\n<head>\n <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>\n <title>\n     [email protected] prepping questions for the Cheney intvw. @CNNSitRoom today. 5p. \n   </title>\n <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />...

，我試圖抓住<title>和</title>標籤之間的值。它應該相當簡單，但它不起作用。這是我的Python代碼。

result = re.search('\<title\>(.+?)\</title\>', html) 
if result is not None: 
    title = result.group(0)

這對任何原因都不適用於本文。它將result.group（）返回爲None，或者我得到一個AttributeError。 AttributeError：'NoneType'對象沒有屬性'groups'

我C &把這段文字轉換成在線python正則表達式的開發者並嘗試了所有的選項（re.match，re.findall，re.search）和他們在那裏工作，但無論在我的腳本中的任何原因，它都無法找到這些標籤之間的任何東西。即使嘗試其他的正則表達式，如

<title>(.*?)</title>

等

來源

2012-06-22 Reily Bourne

您應該使用dotall flag來使.與換行符匹配。

result = re.search('\<title\>(.+?)\</title\>', html, re.DOTALL)

由於文件說：

...without this flag, '.' will match anything except a newline

來源

2012-06-22 22:28:27 Junuxx

如果你想抓住<title>和<\title>標籤之間的測試，你應該使用這個正則表達式：

pattern = "<title>([^<]+)</title>" 

re.findall(pattern, html_string)

來源

2012-06-22 22:28:12 user278064

爲什麼're.DOTALL'標誌？你甚至不使用'.'。 – ohaal

@ohaal：對！非常感謝。 – user278064

Python的正則表達式找不到子字符串，但它應該

回答

相關問題