2014-01-09 69 views
3

我有一個奇數csv文件全髖關節置換具有頭值並在如以下的方式與其對應的數據的數據轉換爲另一種csv文件:提取從CSV奇怪排列數據和使用python

,,,Completed Milling Job,,,,,, # row 1 

,,,,Extended Report,,,,, 

,,Job Spec numerical control,,,,,,, 

Job Number,3456,,,,,, Operator Id,clipper, 

Coder Machine Name,Caterpillar,,,,,,Job Start time,3/12/2013 6:22, 

Machine type,Stepper motor,,,,,,Job end time,3/12/2013 9:16, 

我需要提取從這個strucutre數據創建另一個csv文件按如下結構:

Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time,,, # header 
Completed Milling Job,3456,Caterpillar,Stepper motor,clipper,3/12/2013 6:22,3/12/2013 9:16,,, # data row 

如果你注意到,有一個新的標題欄添加了所謂的「地位」,但值是CSV文件的第一排。輸出文件中的其餘列名是從原始文件中提取的文件中。

任何想法,將不勝感激 - 感謝

+0

原始文件格式如下: – user3130236

+0

原始文件中是否有多個作業或每個作業是否有單獨的文件? – mmdanziger

+0

每個作業都有單獨的文件。所以我想要提取的只是該文件的一行 – user3130236

回答

0

假設文件都是完全一樣的(至少在蓋帽方面)這應該工作,雖然我只能保證它在您提供的確切的數據:

#!/usr/bin/python 
import glob 
from sys import argv 

g=open(argv[2],'w') 
g.write("Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time\n") 
for fname in glob.glob(argv[1]): 
    with open(fname) as f: 
     status=f.readline().strip().strip(',') 
     f.readline()#extended report not needed 
     f.readline()#job spec numerical control not needed 
     s=f.readline() 
     job_no=s.split('Job Number,')[1].split(',')[0] 
     op_id=s.split('Operator Id,')[1].strip().strip(',') 
     s=f.readline() 
     machine_name=s.split('Coder Machine Name,')[1].split(',')[0] 
     start_t=s.split('Job Start time,')[1].strip().strip(',') 
     s=f.readline() 
     machine_type=s.split('Machine type,')[1].split(',')[0] 
     end_t=s.split('Job end time,')[1].strip().strip(',') 
    g.write(",".join([status,job_no,machine_name,machine_type,op_id,start_t,end_t])+"\n") 
g.close() 

它需要一個水珠參數(如Job*.data)和一個輸出文件名,並應建立你所需要的。只需將它保存爲'so.py'或其他東西,然後將其作爲python so.py <data_files_wildcarded> output.csv

+0

非常感謝。我會試試這個代碼 – user3130236

+0

如果你發現答案有用,請點擊複選標記來點贊和/或接受它。 – mmdanziger

+0

肯定..絕對..謝謝 – user3130236

0

以下解決方案適用於任何與所顯示的模式相同的CSV文件。這是一個嚴重惡劣的格式。

我對這個問題很感興趣,並在我的午休時間裏對它進行了處理。代碼如下:

COMMA = ',' 
NEWLINE = '\n' 

def _kvpairs_from_line(line): 
    line = line.strip() 
    values = [item.strip() for item in line.split(COMMA)] 

    i = 0 
    while i < len(values): 
     if not values[i]: 
      i += 1 # advance past empty value 
     else: 
      # yield pair of values 
      yield (values[i], values[i+1]) 
      i += 2 # advance past pair 

def kvpairs_by_column_then_row(lines): 
    """ 
    Given a series of lines, where each line is comma-separated values 
    organized as key/value pairs like so: 
     key_1,value_1,key_n+1,value_n+1,... 
     key_2,value_2,key_n+2,value_n+2,... 
     ... 
     key_n,value_n,key_n+n,value_n+n,... 

    Yield up key/value pairs taken from the first column, then from the second column 
    and so on. 
    """ 
    pairs = [_kvpairs_from_line(line) for line in lines] 
    done = [False for _ in pairs] 
    while not all(done): 
     for i in range(len(pairs)): 
      if not done[i]: 
       try: 
        key_value_tuple = next(pairs[i]) 
        yield key_value_tuple 
       except StopIteration: 
        done[i] = True 

STATUS = "Status" 
columns = [STATUS] 

d = {} 

with open("data.csv", "rt") as f: 
    # get an iterator that lets us pull lines conveniently from file 
    itr = iter(f) 

    # pull first line and collect status 
    line = next(itr) 
    lst = line.split(COMMA) 
    d[STATUS] = lst[3] 

    # pull next lines and make sure the file is what we expected 
    line = next(itr) 
    assert "Extended Report" in line 
    line = next(itr) 
    assert "Job Spec numerical control" in line 

    # pull all remaining lines and save in a list 
    lines = [line.strip() for line in f] 

for key, value in kvpairs_by_column_then_row(lines): 
    columns.append(key) 
    d[key] = value 

with open("output.csv", "wt") as f: 
    # write column headers line 
    line = COMMA.join(columns) 
    f.write(line + NEWLINE) 
    # write data row 
    line = COMMA.join(d[key] for key in columns) 
    f.write(line + NEWLINE)