2014-04-05 52 views
0

我是Python新手,無法用Python來思考這個問題。我有一個SMS消息的文本文件。我想要捕捉多行表述。在Python中確定行的模式

import fileinput 

parsed = {} 

for linenum, line in enumerate(fileinput.input()): 
### Process the input data ### 
    try: 
     parsed[linenum] = line 
    except (KeyError, TypeError, ValueError): 
     value = None 
############################################### 
### Now have dict with value: "data" pairing ## 
### for every text message in the archive ##### 
############################################### 
for item in parsed: 
    sent_or_rcvd = parsed[item][:4] 
    if sent_or_rcvd != "rcvd" and sent_or_rcvd != "sent" and sent_or_rcvd != '--\n': 
     ########################################### 
     ### Know we have a second or third line ### 
     ########################################### 

但這裏是我撞牆的地方。我不確定包含我在這裏獲得的字符串的最佳方式是什麼。我很喜歡一些專家的意見。使用Python 2.7.3,但很樂意移動到3.

目標:從這些SMS中有一個充滿三行引號的人類可讀文件。

示例文本:

12425234123|2011-03-19 11:03:44|words words words words 
12425234123|2011-03-19 11:04:27|words words words words 
12425234123|2011-03-19 11:05:04|words words words words 
12482904328|2011-03-19 11:13:31|words words words words 
-- 
12482904328|2011-03-19 15:50:48|More bolder than flow 
More cumbersome than pleasure; 
Goodbye rocky dump 
-- 

(是的,我可以告訴大家,這是一個關於便便的俳句,我試圖從過去5年的短信我最好的朋友的捕捉它們)

理想的結果是:

Haipu 3
2011-03-19
More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump

+1

,如果你給例如輸入和預期的輸出 –

+1

你能不能給我們輸入一個簡單的例子這將是helfpul文件,以及您期望作爲輸出的內容? thx – jrjc

回答

1
import time 

data = """12425234123|2011-03-19 11:03:44|words words words words 
12425234123|2011-03-19 11:04:27|words words words words 
12425234123|2011-03-19 11:05:04|words words words words 
12482904328|2011-03-19 11:13:31|words words words words 
-- 
12482904328|2011-03-19 15:50:48|More bolder than flow 
More cumbersome than pleasure; 
Goodbye rocky dump """.splitlines() 

def get_haikus(lines): 
    haiku = None 
    for line in lines: 
     try: 
      ID, timestamp, txt = line.split('|') 
      t = time.strptime(timestamp, "%Y-%m-%d %H:%M:%S") 
      ID = int(ID) 
      if haiku and len(haiku[1]) ==3: 
       yield haiku 
      haiku = (timestamp, [txt]) 
     except ValueError: # happens on error with split(), time or int conversion 
      haiku[1].append(line) 
    else: 
     yield haiku 

# now get_haikus() returns tuple (timestamp, [lines]) 
for haiku in get_haikus(data): 
    timestamp, text = haiku 
    date = timestamp.split()[0] 
    text = '\n'.join(text) 
    print """{d}\n{txt}""".format(d=date, txt=text) 
+0

太棒了!那就是訣竅。我正在研究您所做的電話以瞭解這裏發生了什麼。我看到'line.split'使得解析輕鬆完成。再次感謝。 – mbb

+0

一個問題:使用'open'導入文件時我在'TypeError:'NoneType'類'except語句的錯誤沒有屬性'__getitem __''。任何建議更好的錯誤處理或解析在這裏?我錯過了什麼? – mbb

+0

明白!根本沒有處理隨機空行。添加到'如果haiku' a'而不是None'。 – mbb

1

一個好的開始可能類似於以下內容。我正在從名爲data2的文件讀取數據,但read_messages生成器將消耗任何迭代中的行。

#!/usr/bin/env python 

def read_messages(file_input): 
    message = [] 
    for line in file_input: 
     line = line.strip() 
     if line[:4].lower() in ('rcvd', 'sent', '--'): 
      if message: 
       yield message 
       message = [] 
     else: 
      message.append(line) 
    if message: 
     yield message 


with open('data2') as file_input: 
    for msg in read_messages(file_input): 
     print msg 

這預計輸入看起來像下面這樣:

sent 
message sent away 
it has multiple lines 
-- 
rcvd 
message received 
rcvd 
message sent away 
it has multiple lines