2012-12-11 55 views
0

我試圖從以下格式的大型CSV文件中提取數據,假設'x'是文本或整數形式的數據。每個分組都有一個唯一的ID,但每個分組或顏色並不總是具有相同的行數。數據通過逗號與顏色分開。Python來提取和排序文件中的數據

id, x 
red, x 
green, x 
blue, x 
black, x 

id, x 
yellow, x 
green, 
blue, x 
black, x 

id, x 
red, x 
green, x 
blue, x 
black, x 

id, x 
red, x 
green, x 
blue, x 

id, x 
red, x 
green, x 
blue, x 
black, x 

我想以列格式重新排列數據。該ID應該是第一列,並且用逗號分隔所有數據。我的目標是讓它讀取行中的第一個單詞並將其放在適當的列中。

line 0 - ID - red - green - blue - yellow - black 
line 1 - x, x, x, , x, 
line 2 - , x, x, x, x, 
line 3 - x, x, x, , x, 
line 4 - x, x, x, , , 
line 5 - x, x, x, , x, 

這是我努力...

readfile = open("db-short.txt", "r") 
datafilelines = readfile.readlines() 

writefile = open("sample.csv", "w") 

temp_data_list = ["",]*7 
td_index = 0 

for line_with_return in datafilelines: 
    line = line_with_return.replace('\n','') 
    if not line == '': 
     if not (line.startswith("ID") or 
       line.startswith("RED") or 
       line.startswith("GREEN") or 
       line.startswith("BLUE") or 
       line.startswith("YELLOW") or 
       line.startswith("BLACK")): 
      temp_data_list[td_index] = line 
      td_index += 1 

      temp_data_list[6] = line 
     if (line.startswith("BLACK") or line.startswith("BLACK")): 
      temp_data_list[5] = line 
     if (line.startswith("YELLOW") or line.startswith("YELLOW")): 
      temp_data_list[4] = line 
     if (line.startswith("BLUE") or line.startswith("BLUE")): 
      temp_data_list[3] = line 
     if (line.startswith("GREEN") or line.startswith("GREEN")): 
      temp_data_list[2] = line 
     if (line.startswith("RED") or line.startswith("RED")): 
      temp_data_list[1] = line 
     if (line.startswith("ID") or line.find("ID") > 0): 
      temp_data_list[0] = line 
    if line == '': 
     temp_data_str = "" 
     for temp_data in temp_data_list: 
      temp_data_str += temp_data + "," 
     temp_data_str = temp_data_str[0:-1] + "\n" 
     writefile.write(temp_data_str) 

     temp_data_list = ["",]*7 
     td_index = 0 

if temp_data_list[0]: 
    temp_data_str = "" 
    for temp_data in temp_data_list: 
     temp_data_str += temp_data + "," 
    temp_data_str = temp_data_str[0:-1] + "\n" 
    writefile.write(temp_data_str) 
readfile.close() 
writefile.close() 
+1

你嘗試過這麼遠嗎?標準庫'csv'模塊可能是一個很好的開始。 –

+0

我知道你說你想要一個Python解決方案,但你有沒有考慮R?它是專爲這些類型的任務 – Stedy

+0

,我會confesss我新的編程,我試圖用這個... http://ubuntuforums.org/showpost.php?p=6159649&postcount=4 但我一直得到這個錯誤。 IndexError:列表分配索引超出範圍 現在我才知道這是因爲數據是如何格式化 我會看看在r –

回答

1

這是假設的Python < 2.7(因此沒有利用與內置打開多個文件與一個with,寫頭-in writeheaders等。請注意,爲了使它正常工作,我刪除了CSV中逗號之間的空格。正如@JamesHenstridge所提到的那樣,肯定值得在csv模塊上進行閱讀,以便這樣做更有意義

import csv 

with open('testfile', 'rb') as f: 
    with open('outcsv.csv', 'wb') as o: 
    # Specify your field names 
    fieldnames = ('id', 'red', 'green', 'blue', 'yellow', 'black') 

    # Here we create a DictWriter, since your data is suited for one 
    writer = csv.DictWriter(o, fieldnames=fieldnames) 

    # Write the header row 
    writer.writerow(dict((h, h) for h in fieldnames)) 

    # General idea here is to build a row until we hit a blank line, 
    # at which point we write our current row and continue 
    new_row = {} 
    for line in f.readlines(): 
     # This will split the line on a comma/space combo and then 
     # Strip off any commas/spaces that end a word 
     row = [x.strip(', ') for x in line.strip().split(', ')] 
     if not row[0]: 
     writer.writerow(new_row) 
     new_row = {} 
     else: 
     # Here we write a blank string if there is no corresponding value; 
     # otherwise, write the value 
     new_row[row[0]] = '' if len(row) == 1 else row[1].strip() 

    # Check new_row - if not blank, it hasn't been written (so write) 
    if new_row: 
     writer.writerow(new_row) 

使用您的數據上面(有扔在一些隨機逗號分隔的數字),這寫道:

id,red,green,blue,yellow,black 
x,"2,8","2,4",x,,x 
x,,,"4,3",x,x 
x,x,x,x,,x 
x,x,x,x,, 
x,x,x,x,,x 
+0

你錯過了'for'語句中的'if'開始發言? –

+0

@JamesHenstridge哈耶,不知道怎麼沒有被粘貼。稍後會更新,謝謝指出。 – RocketDonkey

+0

文本和逗號之間有隨機空格,是否有辦法讓它檢測空格並將其刪除? –

相關問題