2011-08-23 43 views
0

我試圖從兩行之間存在的文件中抓取一段文本。具體來說,我需要能夠抓住一條線和所有以下線路,直到另一條特定線路。返回匹配日期和時間後的所有行,但不包括下一個日期和時間?

例如,原始文件將包含類似這樣:

Aug 23, 2011 10:31:35 AM This is the start of the text. 
    This is more Text. 
This is another line 
This is another line 
    This is more. 
Aug 23, 2011 10:41:00 AM This is the next in the series. 
This is another line 
    This is more Text. 
This is another line 
    This is another line 
    This is more. 
Aug 24, 2011 10:41:00 AM This is the next in the series. 
This is another line 
    This is more Text. 
This is another line 
    This is another line 
    This is more. 

,我需要它通過分析和迴歸:

Aug 23, 2011 10:31:35 AM This is the start of the text. 
    This is more Text. 
This is another line 
This is another line 
    This is more. 

沒有人有任何建議的方法來實現這個?

+0

簡單的readline循環,停在包含'系列中的下一個'的行 – kusut

+0

編輯你的問題與你試過人們會很樂意幫助您解決問題。 –

+0

去閱讀:[正則表達式HOWTO](http://docs.python.org/howto/regex.html)這是你需要的 – prince

回答

1
import re 

s = '''Aug 23, 2011 10:31:35 AM This is the start of the text. 
     This is more Text. 
This is another line 
This is another line 
     This is more. 
Aug 23, 2011 10:41:00 AM This is the next in the series. 
This is another line 
     This is more Text. 
This is another line 
     This is another line 
     This is more. 
Aug 24, 2011 10:41:00 AM This is the next in the series. 
This is another line 
     This is more Text. 
This is another line 
     This is another line 
     This is more. ''' 


months = '(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)' 
ch = '%s \d\d?, \d{4} \d\d:\d\d:\d\d (?:AM|am|PM|pm)' % months 


regx = re.compile('%s.*?(?=%s|\Z)' % (ch,ch), re.DOTALL) 

for x in regx.findall(s): 
    print repr(x) 
    print 

結果

'Aug 23, 2011 10:31:35 AM This is the start of the text.\n  This is more Text.\nThis is another line\nThis is another line\n  This is more.\n' 

'Aug 23, 2011 10:41:00 AM This is the next in the series.\nThis is another line\n  This is more Text.\nThis is another line\n  This is another line\n  This is more.\n' 

'Aug 24, 2011 10:41:00 AM This is the next in the series.\nThis is another line\n  This is more Text.\nThis is another line\n  This is another line\n  This is more. ' 

是的,你必須學習正則表達式的工具(模塊re

更新:最低的解釋:

括號(...)定義一組
沒有?:,它是一個捕獲組
(?:......)是一個非捕獲組

(?=....)意味着**這點之後,必須有字符串匹配什麼?=後象徵的一部分,但這部分不捕獲:這是獲得的停止方式在這部分之前的正則表達式電機,而不捕獲它;也就是說,更重要的是,正則表達式電機從這個停止部分開始重新匹配,否則後者也將被消耗。

re.DOTALL是使符號。 (點)匹配所有的字符,組成'\ n',這是不是這種情況下,沒有這個標誌

+0

正則表達式有一點陡峭的學習曲線,但有許多偉大的教程(例如http://www.regular-expressions.info/tutorial.html),它非常強大。然而,我總是習慣於這樣做的一件事是對我使用的任何正則表達式進行了嚴格的評論,這些正則表達式甚至有點複雜,否則我會在幾個月後回來,並且必須破譯我最初寫的東西。 – Vorticity

+0

@渦旋非常貼切的評論。我不認爲這樣做,我決定在未來讓我的決議回憶這一建議。而不是upvoting你的評論,我只是upvoted你有趣的答案(http://stackoverflow.com/questions/7133977/datetime-print-as-seconds/7134020#7134020),其中我瞭解到「 'datetime.timedelta.total_seconds()'' – eyquem

相關問題