2015-08-17 91 views
1

我有一個CSV文件看起來像這樣:我不關心狀態使用Python和Pandas重新格式化CSV文件(AWK)?

Names, Size, State, time1, time2,  
S1, 22, MD , 0.022, , 523.324 
S2, 22, MD , 4.32, , 342.54 
S3, 22, MD , 3.54, , 0.32 
S4, 22, MD , 4.32, , 0.54 
S1, 33, MD , 5.32, , 0.43 
S2, 33, MD , 11.54, , 0.65 
S3, 33, MD , 22.5, , 0.324 
S4, 33, MD , 45.89 , 0.32 
S1, 44, MD , 3.53 , 3.32 
S2, 44, MD , 4.5 , 0.322 
S3, 44, MD , 43.65 , 45.78 
S4, 44, MD, 43.54 , 0.321 

我需要爲我的輸出文件看起來像這樣:

Size , S1` , S2 , S3 , S4 

    22 , 0.022 , 4.32 , 45.89 , 4.32 

    33 , 5.32, 11.54 , 22.5, 45.89, 

    44 , 3.53, 4.5,  43.65, 43.54 

     3 values, 3 values, 3,values, 3 values 

如您所見,輸出文件包含不同的頭文件,這些頭文件是來自第一個csv文件的值。 csv文件按大小列排序。換句話說,我想知道哪個時間與每個文件(S1,S2,S3,S4)的大小相關聯。列的順序也會改變。輸入文件中第一列中的大小列。最後一行也表示每列中的總數值。

我的代碼迄今:

import pandas as pd 
import numpy as np 
import csv 

df=pd.read_csv(r'C:\Users\testuser\Desktop\file.csv',usecols=[0,1,2,3,4]) 
df.columns=pd.MultiIndex.from_tuples(zip(['Names','FileSize','x','y','z'],df.columns)) *#add column headers... (this did not do it correctly)* 
df_out=df.groupby('Names','FileSize').count().reset_index() *#suppose to print distinct values* 
df_out.to_csv('processed_data_out.csv', columns['Names','FileSize','x','y','z'], header=False,index=False) 

我知道我沒有使用最後一列time2,因爲我不知道如何添加它使用戶能夠知道什麼時間(包括時間1和時間2)與大小有關。

回答

2

awk在這裏沒有必要的,因爲你已經使用python,我會留在蟒蛇:

convert.py:

import csv 
import sys 

filename = sys.argv[1] 

with open(filename, 'rb') as csvfile: 
    reader = csv.reader(csvfile) 
    data = {} 
    next(reader, None) # skip the headers 
    for row in reader: 
     size = int(row[1]) 
     time1 = float(row[3]) 

     if not size in data: 
      data[size] = [] 

     data[size].append(time1) 


writer = csv.writer(sys.stdout) 
writer.writerow(["Size","S1","S2","S3","S4"]) 
for item in data: 
    row = [item] 
    row.extend(data[item]) 
    writer.writerow(row) 

這樣稱呼它:

python convert.py C:\Users\testuser\Desktop\file.csv 

輸出:

Size,S1,S2,S3,S4 
33,5.32,11.54,22.5,45.89 
44,3.53,4.5,43.65,43.54 
22,0.022,4.32,3.54,4.32 

順便說一句,一個awk解決方案看起來是這樣的:

awk -F'[, ]*' ' 
    NR>1{ 
     a[$2]=a[$2]","$4 
    } 
    END{ 
     for(i in a){ 
      print i""a[i] 
     } 
    }' input.csv 
+0

這隻打印大小欄沒有任何重複 – royalblue

+0

不能重現那個。你的輸入文件被稱爲'input.csv'(或者你改變了)? – hek2mgl

+0

我已經更改了代碼,現在您可以通過命令行傳遞文件名了。 – hek2mgl

0

AWK救援

awk -F, -f table.awk 

其中

$ cat table.awk 

    NR == 1 { 
      h = $1   # save header 
      next 
    } 

    NR == 2 { 
      p = $2   # to match blocks 
      v = $2   # value accumulator 
    } 

    p == $2 {    # we're in the same block 
      v = v FS $4  # start accumulate values 
      if (h != "") { # if we're not done with header 
        h = h FS $1 # accumulate header values 
      } 
    } 

    p != $2 {    # we're in a new block 
      if (h != "") { # if not printed yet, print header 
        print h 
        h = "" # and reset 
      } 
      print v   # print values 
      p = $2   # set new block indicator 
      v = $2 FS $4  # accumulate values 
    } 

    END { 
      print v   # for the final block print values 
    } 

測試

awk -F, -f table.awk << ! 
> Names, Size, State, time1, time2, 
> S1, 22, MD , 0.022, , 523.324 
> S2, 22, MD , 4.32, , 342.54 
> S3, 22, MD , 3.54, , 0.32 
> S4, 22, MD , 4.32, , 0.54 
> S1, 33, MD , 5.32, , 0.43 
> S2, 33, MD , 11.54, , 0.65 
> S3, 33, MD , 22.5, , 0.324 
> S4, 33, MD , 45.89 , 0.32 
> S1, 44, MD , 3.53 , 3.32 
> S2, 44, MD , 4.5 , 0.322 
> S3, 44, MD , 43.65 , 45.78 
> S4, 44, MD, 43.54 , 0.321 
> ! 
Names,S1,S2,S3,S4 
22, 0.022, 4.32, 3.54, 4.32 
33, 5.32, 11.54, 22.5, 45.89 
44, 3.53 , 4.5 , 43.65 , 43.54 
+0

這打印狀態欄 – royalblue

+0

不這麼認爲。只是根據您的輸入數據添加了測試輸出。請注意,在要打印的代碼中沒有第三個字段的引用。也許你的樣本輸入和你的測試輸入是不同的? – karakfa

+0

@karakfa關於[this]呢?(http://stackoverflow.com/questions/32042568/reformat-csv-file-using-python-and-pandas-awk/32058235?noredirect=1#comment52018403_32058235)? – hek2mgl