2011-11-10 79 views
0

我有一個大的csv文件,每10秒記錄一次日期時間和值。 CSV文件看起來是這樣的:分鐘平均csv文件

 
Datetime    Data 
2008-10-01 12:00:10, 34 
2008-10-01 12:00:20, 55 
2008-10-01 12:00:30, 46 
2008-10-01 12:00:40, 33 
2008-10-01 12:00:50, 55 
2008-10-01 12:01:00, 21 
2008-10-01 12:01:10, 2 
2008-10-01 12:01:20, 34 
2008-10-01 12:01:30, 521 
2008-10-01 12:01:40, 45 
2008-10-01 12:01:50, 32 
2008-10-01 12:02:00, 34 

我想編寫一個腳本,將計算出分平均值,然後在新的CSV文件寫入給下面的輸出:

 
Datetime    Data 
2008-10-01 12:00:00, 40.67 
2008-10-01 12:01:00, 111.33 

任何想法,這可怎麼完成任何有關我應該查看的模塊或任何示例的建議。

+0

你知道哪些腳本語言?這可以通過大量的語言來完成,甚至可以在excel中完成。給我們一些關於平臺或首選語言的指導,我們可以爲您提供更多幫助。 – dan360

+0

@ dan360問題標籤爲Python。 – agf

+0

爲什麼一個負面投票我正在學習python,我想在python中這樣做,我問了我應該看看的模塊。 – Navin

回答

1

使用csv.reader解析文件和字典以聚集結果。 str.rpartition方法可以分裂秒。使用sumlen計算平均:

data = '''\ 
2008-10-01 12:00:10, 34 
2008-10-01 12:00:20, 55 
2008-10-01 12:00:30, 46 
2008-10-01 12:00:40, 33 
2008-10-01 12:00:50, 55 
2008-10-01 12:01:00, 21 
2008-10-01 12:01:10, 2 
2008-10-01 12:01:20, 34 
2008-10-01 12:01:30, 521 
2008-10-01 12:01:40, 45 
2008-10-01 12:01:50, 32 
2008-10-01 12:02:00, 34 
'''.splitlines() 

import csv 

d = {} 
for timestamp, value in csv.reader(data): 
    minute, colon, second = timestamp.rpartition(':') 
    if minute not in d: 
     d[minute] = [float(value)] 
    else: 
     d[minute].append(float(value)) 

for minute, values in sorted(d.items()): 
    avg_value = sum(values)/len(values) 
    print minute + ',' + str(avg_value) 
+0

爲什麼不是'defaultdict'或'setdefault'?和/或爲什麼失去命令然後重建它,而不是使用OrderedDict? – agf

+1

這些都是我自然而然的選擇,但這是一個初學者問題,所以答案需要簡潔的Python(使用普通字典,字符串方法,類型轉換,未格式化的打印以及每行最少步數)。 –

2

在我看來,最簡單的方式就是把時間作爲一個字符串,而不是一個時間,並使用itertools.groupby

from csv import reader 
from itertools import groupby 

lines = """Datetime    Data 
2008-10-01 12:00:10, 34 
2008-10-01 12:00:20, 55 
2008-10-01 12:00:30, 46 
2008-10-01 12:00:40, 33 
2008-10-01 12:00:50, 55 
2008-10-01 12:01:00, 21 
2008-10-01 12:01:10, 2 
2008-10-01 12:01:20, 34 
2008-10-01 12:01:30, 521 
2008-10-01 12:01:40, 45 
2008-10-01 12:01:50, 32 
2008-10-01 12:02:00, 34""" 

lines = iter(lines.splitlines()) 

# above this is just for testing, really you'd do 
# with open('filename', 'rb') as lines: 
# and indent the rest 

next(lines) 

for minute, group in groupby(reader(lines), lambda row: row[0][:16]): 
    group = list(group) 
    print minute, sum(float(row[1]) for row in group)/len(group)