2017-02-13 24 views
0

我有下表,這是task-spooler的輸出。用熊貓閱讀複雜的表('task-spooler')

它對人類來說很容易解析,但是我很難將它讀入熊貓DF中。

有什麼想法?

ID State  Output    E-Level Times(r/u/s) Command [run=1/2] 
6 running /tmp/ts-out.FzVneG       [l1]python infloop.py 
0 finished /tmp/ts-out.ixWHm2 0  0.00/0.00/0.00 bash -c echo 1 
1 finished /tmp/ts-out.ZzwS11 0  0.00/0.00/0.00 bash -c echo 1 
2 finished /tmp/ts-out.GJlyge 2  0.00/0.00/0.00 bash -c 
4 finished /tmp/ts-out.lIVMYH 2  0.00/0.00/0.00 bash -c -h 
5 finished /tmp/ts-out.8EKHy1 -1  141.23/0.00/0.00 python infloop.py 
3 finished /tmp/ts-out.lBr4Wy -1  2545.36/0.00/0.02 bash -c python infloop.py 
7 finished /tmp/ts-out.kxCczi 2  0.01/0.00/0.00 bash -c 
8 finished /tmp/ts-out.3VkfNh 0  0.00/0.00/0.00 echo 
9 finished /tmp/ts-out.8ewxzl 0  0.01/0.00/0.00 echo 
10 finished /tmp/ts-out.ahSLaY 0  0.00/0.00/0.00 bash -c echo $GPUID 
11 finished /a/home/cc/cs/yuvval/tmp/ts-out.3dpaBO 0  0.00/0.00/0.00 bash -c ls 
12 finished /tmp/ts-out.ADWkve 0  0.00/0.00/0.00 bash -c ls 
13 finished /a/home/cc/cs/yuvval/tmp/ts-out.xm0jtn -1  130.67/0.00/0.02 bash -c python infloop.py 
14 finished /tmp/ts-out.HxBqkm 0  0.00/0.00/0.00 bash -c echo 11 
15 finished /tmp/ts-out.ERNuaE 0  0.00/0.00/0.00 bash -c echo 
16 finished /tmp/ts-out.9j6hkS 0  0.00/0.00/0.00 bash -c echo $GPUID 
17 finished /tmp/ts-out.Y5QDNa 0  0.00/0.00/0.00 bash -c echo $GPUID 
18 finished /tmp/ts-out.EIHhoX -1  0.00/0.00/0.00 %s 
19 finished /tmp/ts-out.LLw2Wl -1  0.00/0.00/0.00 
20 finished /tmp/ts-out.deWAJR -1  0.01/0.00/0.00 echo $GPUID 
21 finished /tmp/ts-out.AdZFIf -1  0.00/0.00/0.00 echo 12 
22 finished /tmp/ts-out.NBOCVv 0  0.00/0.00/0.00 echo 12 
23 finished /tmp/ts-out.5WpfPu 0  0.00/0.00/0.00 echo 
24 finished /tmp/ts-out.1lw4bS -1  0.00/0.00/0.00 echo 
25 finished /tmp/ts-out.7MNGLQ 0  0.00/0.00/0.00 bash -c echo $GPUID 
26 finished /tmp/ts-out.8FZ3on 0  0.00/0.00/0.00 bash -c echo $GPUID 

我最好的嘗試是:

from StringIO import StringIO as sIO 
std = ... # the table text 
pd.read_table(sIO(std), sep='\s+', engine='python') 

錯誤:

ValueError: Expected 7 fields in line 2, saw 9 

編輯: 產生的表可用的源代碼。這裏是生成每一行的命令。這可以幫助將表讀入數據框嗎?

if (p->label) 
    snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s[%s]" 
      "%s\n", 
      p->jobid, 
      jobstate, 
      output_filename, 
      p->result.errorlevel, 
      p->result.real_ms, 
      p->result.user_ms, 
      p->result.system_ms, 
      dependstr, 
      p->label, 
      p->command); 
else 
    snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s%s\n", 
      p->jobid, 
      jobstate, 
      output_filename, 
      p->result.errorlevel, 
      p->result.real_ms, 
      p->result.user_ms, 
      p->result.system_ms, 
      dependstr, 
      p->command); 
+0

這是製表符分隔的嗎?嘗試'sep ='\ t'' – EdChum

+0

@EdChum,no。使用'\ t'將所有的列放在一個列中 – yuval

+1

'df = pd.read_csv('file',sep = r'\ s {2,}',engine ='python')''? - 分隔符是正則表達式 - '2和更多空格' – jezrael

回答

0

這是一種惱人的,但由於分離器是不是在輸出日誌(有時多個空格,有時標籤和在最後一欄通常只是一個空格)一致很難,而不適用於任何額外的邏輯來解析文件,然後用大熊貓解析。 我個人不喜歡在python中打開文件來修復它,然後用熊貓加載它,所以我只需在我的管道中添加一個簡短的sed命令,然後在python中加載文件(這非常簡單,如果您正在使用Linux和日誌文本是否從文件加載)。 您可以添加:

cat logfile.log | sed -r 's/\s\s+/,/g' | sed -e 's/\([[:digit:]].[[:digit:]]\{2\}\) /\1,/' > logfile.csv 

然後你只需更換用逗號所有空格以及最後的,有問題的空間。 的文本,然後轉由:

ID State  Output    E-Level Times(r/u/s) Command [run=1/2] 
6 running /tmp/ts-out.FzVneG       [l1]python infloop.py 
0 finished /tmp/ts-out.ixWHm2 0  0.00/0.00/0.00 bash -c echo 1 
1 finished /tmp/ts-out.ZzwS11 0  0.00/0.00/0.00 bash -c echo 1 
2 finished /tmp/ts-out.GJlyge 2  0.00/0.00/0.00 bash -c 
4 finished /tmp/ts-out.lIVMYH 2  0.00/0.00/0.00 bash -c -h 
5 finished /tmp/ts-out.8EKHy1 -1  141.23/0.00/0.00 python infloop.py 
3 finished /tmp/ts-out.lBr4Wy -1  2545.36/0.00/0.02 bash -c python infloop.py 
7 finished /tmp/ts-out.kxCczi 2  0.01/0.00/0.00 bash -c 
8 finished /tmp/ts-out.3VkfNh 0  0.00/0.00/0.00 echo 

要這樣:

ID,State,Output,E-Level,Times(r/u/s),Command [run=1/2] 
6,running,/tmp/ts-out.FzVneG,[l1]python infloop.py 
0,finished,/tmp/ts-out.ixWHm2,0,0.00/0.00/0.00,bash -c echo 1 
1,finished,/tmp/ts-out.ZzwS11,0,0.00/0.00/0.00,bash -c echo 1 
2,finished,/tmp/ts-out.GJlyge,2,0.00/0.00/0.00,bash -c 
4,finished,/tmp/ts-out.lIVMYH,2,0.00/0.00/0.00,bash -c -h 
5,finished,/tmp/ts-out.8EKHy1,-1,141.23/0.00/0.00,python infloop.py 
3,finished,/tmp/ts-out.lBr4Wy,-1,2545.36/0.00/0.02,bash -c python infloop.py 
7,finished,/tmp/ts-out.kxCczi,2,0.01/0.00/0.00,bash -c 
8,finished,/tmp/ts-out.3VkfNh,0,0.00/0.00/0.00,echo 

然後在大熊貓加載它作爲CSV:

import pandas as pd 
my_df = pd.read_csv(my_log_file) 

我很抱歉,這不是一個好玩的純Python解決方案,但在我看來,bash部分使python部分變得更加容易。

+1

謝謝,只要它工作,我確定'sed'。但是,您的解決方案仍不能解決不兼容的空間問題。例如看到你的第二個csv行,理想情況下應該有'[l1] python infloop.py'附加命令' – yuval

+0

意思是說額外的*逗號 – yuval

+1

對不起,由於某種原因,我認爲這行忽略了最後一列,而不是兩列中間。 –