新手Python的正則表達式的問題：從網頁

我期待使用Python從網頁拉文本的規則字符串拉日期 - 源代碼運行是這樣的：新手Python的正則表達式的問題：從網頁

<br /><strong>Date: 06/12/2010</strong> <br />

它總是開始

<strong>Date:

&結束

</strong>

我已經刮的網頁，只是WA文nt提取日期和類似結構化的信息。任何建議如何做到這一點？（很抱歉，這是這樣一個新手的問題！）

來源

2010-12-16 Paul Bradshaw

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – katrielalex 2010-12-16 16:15:30

import re 

text = "<br /><strong>Date: 06/12/2010</strong> <br />" 
m = re.search("<strong>(Date:.*?)</strong>", text) 
print m.group(1)

輸出

Date: 06/12/2010

來源

2010-12-16 16:11:31 Rod

還有一個人被貪婪咬了......這會給你是一個非常大的組織，涵蓋了從第一個' Data：'到最後一個''的所有內容。 – delnan 2010-12-16 16:13:23

修復：用'。*？'替換'。*' – nmichaels 2010-12-16 16:15:28

@delnan真。我站好了！ – Rod 2010-12-16 16:16:05

您可以使用正則表達式：

import re 
pattern = re.compile(r'<strong>Date:(?P<date>.*?)</strong>') # re.MULTILINE? 
# Then use it with 
pattern.findall(text) # Returns all matches 
# or 
match = pattern.search(text) # grabs the first match 
match.groupdict() # gives a dictionary with key 'date' 
# or 
match.groups()[0] # gives you just the text of the match.

或嘗試解析的東西與beautiful soup。

This是一個測試Python正則表達式的好地方。

來源

2010-12-16 16:11:49 nmichaels

有人能解釋一下嗎？？？？？？？？？？？？？ – Pete 2010-12-16 16:24:29

它爲組織提供了一個名稱（日期）。這不是絕對必要的;你可以省略'？P '，但是'match.groupdict（）'不起作用。在http://docs.python.org/library/re.html上查找'？P <' – nmichaels 2010-12-16 16:44:20

新手Python的正則表達式的問題：從網頁

回答

相關問題