使用Python切塊數據塊

大家好，我有一個大文件，格式如下。數據處於「塊」格式。一個「塊」含有三行：時間T，用戶U和內容W. 例如，這是一個塊：使用Python切塊數據塊

T 2009-06-11 21:57:23 
U tracygazzard 
W David Letterman is good man

因爲我將只使用含有特定的密鑰字的塊。我從原始的海量數據塊中逐塊分割數據，而不是將整個數據轉儲到內存中。每次在一個塊中讀取，如果包含單詞「bike」的內容行，將該塊寫入磁盤。

您可以使用以下兩個塊來測試您的腳本。

T 2009-06-11 21:57:23 
U tracygazzard 
W David Letterman is good man 

T 2009-06-11 21:57:23 
U charilie 
W i want a bike

我試圖通過線做的工作線：

data = open("OWS.txt", 'r') 
output = open("result.txt", 'w') 

for line in data: 
    if line.find("bike")!= -1: 
    output.write(line)

來源

2012-05-05 Frank Wang

謝謝，我曾嘗試使用數據行：如果line.find（「自行車」）！= -1： \t output.write（行） –

因此我可以逐行解決問題，但我不不知道該怎麼做。而且你不需要提供所有的代碼。只是關鍵部分。 –

每個塊中的行是否以'T'，'U'和'W'開頭？ –

正如你塊的格式是固定的，你可以使用一個列表來保存一個塊，然後看看bike是否在該塊中：

data = open("OWS.txt", 'r') 
output = open("result.txt", 'w') 

chunk = [] 
for line in data: 
    chunk.append(line) 
    if line[0] == 'W': 
     if 'bike' in str(chunk): 
      for line in chunk: 
       output.write(line) 
     chunk = []

來源

2012-05-05 10:20:13 fraxel

不錯的主意。 ~~~~~~~~~~~~ –

您可以使用正則表達式：

import re 
data = open("OWS.txt", 'r').read() # Read the entire file into a string 
output = open("result.txt", 'w') 

for match in re.finditer(
    r"""(?mx)   # Verbose regex,^matches start of line 
    ^T\s+(?P<T>.*)\s* # Match first line 
    ^U\s+(?P<U>.*)\s* # Match second line 
    ^W\s+(?P<W>.*)\s* # Match third line""", 
    data): 
     if "bike" in match.group("W"): 
      output.write(match.group()) # outputs entire match

來源

2012-05-05 08:08:52

你有沒有考慮過內存問題？＃將整個文件讀入字符串 –

@FrankWANG：那麼，你的文件有多大？ –

它是26 G，但我可以將它分成更小的。 –

使用Python切塊數據塊

回答

相關問題