2017-05-07 61 views
3

我有這種形式的文本文件:如何閱讀其中一些內容有換行符的文本文件?

06/01/2016, 10:40 pm - abcde 
07/01/2016, 12:04 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 6:14 pm - abcde 

fghe 
07/01/2016, 6:20 pm - abcde 
07/01/2016, 7:58 pm - abcde 

fghe 

ijkl 
07/01/2016, 7:58 pm - abcde 

可以看到,每一行由換行符分隔,但有些行的內容在他們換行符。所以,簡單地按行分隔並不能正確解析每一行。

舉例來說,對於第5項,我想我的輸出是 07/01/2016, 6:14 pm - abcde fghe

這裏是我當前的代碼:

with open('file.txt', 'r') as text_file: 
data = [] 
for line in text_file: 
    row = line.strip() 
    data.append(row) 
+0

是可以包含行數據符本身包含雙引號,以任何機會呢? –

+0

你能告訴我數據應該怎麼看嗎?你的描述不清楚。我看到收入,但不清楚結果應該如何。 – TitanFighter

+0

我希望'data'的每個元素都以日期開頭。 – Imran

回答

1

鑑於你例如輸入,你可以用一個regex以正前瞻:

pat=re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M) 

with open (fn) as f: 
    pprint([m.group(1) for m in pat.finditer(f.read())])  

打印:

['06/01/2016, 10:40 pm - abcde\n', 
'07/01/2016, 12:04 pm - abcde\n', 
'07/01/2016, 12:05 pm - abcde\n', 
'07/01/2016, 12:05 pm - abcde\n', 
'07/01/2016, 6:14 pm - abcde\n\nfghe\n', 
'07/01/2016, 6:20 pm - abcde\n', 
'07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n', 
'07/01/2016, 7:58 pm - abcde\n'] 

隨着Dropbox的例子,打印:

['11/11/2015, 3:16 pm - IK: 12\n', 
'13/11/2015, 12:10 pm - IK: Hi.\n\nBut this is not about me.\n\nA donation, however small, will go a long way.\n\nThank you.\n', 
'13/11/2015, 12:11 pm - IK: Boo\n', 
'15/11/2015, 8:36 pm - IR: Root\n', 
'15/11/2015, 8:36 pm - IR: LaTeX?\n', 
'15/11/2015, 8:43 pm - IK: Ws\n'] 

如果你想刪除\n在被捕獲的內容中,只需將m.group(1).strip().replace('\n', '')添加到上面的列表理解中即可。


說明正則表達式:

^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z) 

^              start of line 
    ^^^^ ^         pattern for a date 
        ^        capture the rest... 
         ^       until (look ahead) 
            ^^^   another date 
               ^ or 
                ^end of string 
+0

這完美謝謝!你能解釋're.compile'裏面的代碼嗎? – Imran

0

你可以使用正則表達式(使用re模塊)來檢查日期是這樣的:

import re 
with open('file.txt', 'r') as text_file: 
    data = [] 
    for line in text_file: 
    row = line.strip() 
    if re.match(r'\d{2}/\d{2}/\d{4}.*'): 
     data.append(row) # date: new record 
    else: 
     data[-1] += '\n' + row # no date: append to last record 

# '\d{2}': two digits 
# '.*': any character, zero or more times 
+0

與迄今爲止的其他方法一樣:如果數據包含分隔符序列(此格式的日期),則中斷。 – handle

1

考慮到','只能顯示爲分隔符,我們m AY檢查線路有一個逗號,如果沒有它串聯到最後一行:

data = [] 

with open('file.txt', 'r') as text_file: 
    for line in text_file: 
     row = line.strip() 
     if ',' not in row: 
      data[-1] += '\n' + row 
     else: 
      data.append(row) 
+0

到目前爲止,阻止逗號出現在數據中(實際上,在問題評論中鏈接的數據文件中有幾個)。可靠的分離是不可能的。 – handle

+0

當我發佈時,只有這個問題的例子,我的代碼將是「最簡單的事情,可能工作」。但是,在評論中鏈接的數據是正確的,這是行不通的... –

0

的長度簡單的測試:

#!python3 
#coding=utf-8 

data = """06/01/2016, 10:40 pm - abcde 
07/01/2016, 12:04 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 6:14 pm - abcde 

fghe 
07/01/2016, 6:20 pm - abcde 
07/01/2016, 7:58 pm - abcde 

fghe 

ijkl 
07/01/2016, 7:58 pm - abcde""" 

lines = data.split("\n") 
out = [] 
for l in lines: 
    c = l.strip() 
    if c: 
     if len(c) < 10: 
      out[-1] += c 
     else: 
      out.append(c) 
    #skip empty 

for o in out: 
    print(o) 

結果:

06/01/2016, 10:40 pm - abcde 
07/01/2016, 12:04 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 6:14 pm - abcdefghe 
07/01/2016, 6:20 pm - abcde 
07/01/2016, 7:58 pm - abcdefgheijkl 
07/01/2016, 7:58 pm - abcde 

不包含數據中的換行符!


但這一個襯裏的正則表達式應該這樣做(在斷行分割後按數字),至少對樣品數據(斷裂時的數據包含換行符後按數字):

#!python3 
#coding=utf-8 

text_file = """06/01/2016, 10:40 pm - abcde 
07/01/2016, 12:04 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 6:14 pm - abcde 

fghe 
07/01/2016, 6:20 pm - abcde 
07/01/2016, 7:58 pm - abcde 

fghe 

ijkl 
07/01/2016, 7:58 pm - abcde""" 

import re 
data = re.split("\n(?=\d)", text_file) 

print(data) 

for d in data: 
    print(d) 

輸出:

['06/01/2016, 10:40 pm - abcde', '07/01/2016, 12:04 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 6:14 pm - abcde\n\ 
nfghe', '07/01/2016, 6:20 pm - abcde', '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl', '07/01/2016, 7:58 pm - abcde'] 
06/01/2016, 10:40 pm - abcde 
07/01/2016, 12:04 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 6:14 pm - abcde 

fghe 
07/01/2016, 6:20 pm - abcde 
07/01/2016, 7:58 pm - abcde 

fghe 

ijkl 
07/01/2016, 7:58 pm - abcde 

(固定用先行)

+0

如果數據包含換行符+數字,則失敗,因此正則表達式需要擴展。在另一方面,這種方法[未經消毒的數據,沒有分隔符]很容易,如果數據中包含着一些新行,看起來像一個數據頭破... – handle

+0

如果其中一個日期,是'2016年12月21日'?如果您使用're.split(r'\ n \ d',txt)'您的日期變爲'2/21/2016' ... – dawg

+0

糟糕,沒有注意到它會消耗數字。 – handle