2017-03-08 172 views
1

我的日誌文件中包含類似下面創建日誌文件蟒蛇CSV頭

Info1:NewOrder|key:123 |Info3:10|Info5:abc 
Info3:10|Info1:OldOrder| key:456| Info6:xyz 
Info1:NewOrder|key:007 

我想將其更改爲CSV像下面(每一行中一些信息,如果我給關鍵,因爲所需的信息1,INFO3頭)

key,Info1.Info3 
123,NewOrder,10 
456,OldOrder,10 
007,NewOrder, 

早些時候,我使用awk來獲取字段值,但日誌記錄可以更改信息和鍵在一行中打印的順序。所以我不能確定Info3會永遠在某個特定的列中。每次記錄更改時,都需要更改腳本。

我打算然後在熊貓數據框中加載csv。所以Python解決方案會更好。這更多的是從日誌文件生成csv的數據清理任務。

這是我讀

import csv 
import sys 
with open(sys.argv[1], 'r') as myLogfile: 
     log=myLogfile.read().replace('\n', '') 

requested_columns = ["OrderID", "TimeStamp", "ErrorCode"] 

def wrangle(string, requested_columns): 
     data = [dict([element.strip().split(":") for element in row.split("|")]) for row in string.split("\n")] 
     body = [[row.get(column) for column in requested_columns] for row in data] 
     return [requested_columns] + body 

outpath = sys.argv[2] 
open(outpath, "w", newline = "") with open(outpath, 'wb') 
     writer = csv.writer(file) 
     writer.writerows(wrangle(log, requested_columns)) 

樣品日誌文件= https://ideone.com/cny805

回答

0

你可以使用一個CSV閱讀與|分隔符,讓你開始,然後使用:給你每排字典如下分割

OrderID,TimeStamp,ErrorCode 
3000000,1488948188555841641, 
3000000,1488948188556444675,0 

要直接將數據讀入一個熊貓數據幀:

import pandas as pd 
import csv 

cols = ["OrderID", "TimeStamp", "ErrorCode"] 
data = [] 

with open('input.csv', 'rb') as f_input: 
    csv_output = csv.writer(f_output) 

    for row in csv.reader(f_input, delimiter='|'): 
     # Remove any entries that do not have a colon 
     row = [c for c in row if c.find(':') != -1] 
     # Convert remaining columns into a dictionary 
     entries = {c.split(':')[0].strip() : c.split(':')[1].strip() for c in row} 
     data.append([entries.get(c, "") for c in cols]) 

df = pd.DataFrame(data, columns=cols) 
print df 

給你:

OrderID   TimeStamp ErrorCode 
0 3000000 1488948188555841641   
1 3000000 1488948188556444675   0 
+0

謝謝,但我得到'TypeError:參數1必須是一個迭代器' – pythonRcpp

+0

你使用的是哪個版本的Python?還有你在哪一行得到錯誤?我已經在Python 2.7.6和Python 3.5.2中測試過了。 –

+0

Python 2.7.5帶有2行的示例日誌文件(符號爲:而不是=):https://ideone.com/cny805需要的列可以是OrderID,Timestamp,ErrorCode – pythonRcpp

0

它的大部分是在運用有用的字符串的方法,如帶和分裂,加上列表解析答案後使用。

import csv 

with open('input.csv', 'rb') as f_input, open('output.csv', 'wb') as f_output: 
    csv_output = csv.writer(f_output) 
    cols = ["OrderID", "TimeStamp", "ErrorCode"] 
    csv_output.writerow(cols) 

    for row in csv.reader(f_input, delimiter='|'): 
     # Remove any entries that do not have a colon 
     row = [c for c in row if c.find(':') != -1] 
     # Convert remaining columns into a dictionary 
     entries = {c.split(':')[0].strip() : c.split(':')[1].strip() for c in row} 
     csv_output.writerow([entries.get(c, "") for c in cols]) 

給你一個輸出文件:

import csv 

string = """Info1=NewOrder|key=123 |Info3=10|Info5=abc 
Info3=10|Info1=OldOrder| key=456| Info6=xyz 
Info1=NewOrder|key=007""" 

requested_columns = ["key", "Info1", "Info3"] 

def wrangle(string, requested_columns): 
    data = [dict([element.strip().split("=") for element in row.split("|")]) for row in string.split("\n")] 
    body = [[row.get(column) for column in requested_columns] for row in data] 
    return [requested_columns] + body 

outpath = "whatever.csv" 

with open(outpath, "w", newline = "") as file: 
    writer = csv.writer(file) 
    writer.writerows(wrangle(string, requested_columns)) 
+0

感謝@Denziloe,能否請您闡述爭吵方法。我仍是一名學習者。 – pythonRcpp

+0

沒問題。它使用了所謂的綜合。你可以在這裏瞭解它們:https://docs.python.org/3.6/tutorial/datastructures.html#list-comprehensions。您還應該嘗試將打印語句放在一邊以打印數據和正文,這會很清楚地顯示它們的工作方式。如果代碼有效,請批准答案。 ; ) – Denziloe

+0

類型錯誤:「新行」是用於該功能'文件「」,第3行的無效關鍵字參數,在 文件「」,第2行,在爭吵 ValueError異常:詞典更新序列元素#0具有長度3; 2是必需的 ' – pythonRcpp