Python來提取和排序文件中的數據

我試圖從以下格式的大型CSV文件中提取數據，假設'x'是文本或整數形式的數據。每個分組都有一個唯一的ID，但每個分組或顏色並不總是具有相同的行數。數據通過逗號與顏色分開。Python來提取和排序文件中的數據

id, x 
red, x 
green, x 
blue, x 
black, x 

id, x 
yellow, x 
green, 
blue, x 
black, x 

id, x 
red, x 
green, x 
blue, x 
black, x 

id, x 
red, x 
green, x 
blue, x 

id, x 
red, x 
green, x 
blue, x 
black, x

我想以列格式重新排列數據。該ID應該是第一列，並且用逗號分隔所有數據。我的目標是讓它讀取行中的第一個單詞並將其放在適當的列中。

line 0 - ID - red - green - blue - yellow - black 
line 1 - x, x, x, , x, 
line 2 - , x, x, x, x, 
line 3 - x, x, x, , x, 
line 4 - x, x, x, , , 
line 5 - x, x, x, , x,

這是我努力...

readfile = open("db-short.txt", "r") 
datafilelines = readfile.readlines() 

writefile = open("sample.csv", "w") 

temp_data_list = ["",]*7 
td_index = 0 

for line_with_return in datafilelines: 
    line = line_with_return.replace('\n','') 
    if not line == '': 
     if not (line.startswith("ID") or 
       line.startswith("RED") or 
       line.startswith("GREEN") or 
       line.startswith("BLUE") or 
       line.startswith("YELLOW") or 
       line.startswith("BLACK")): 
      temp_data_list[td_index] = line 
      td_index += 1 

      temp_data_list[6] = line 
     if (line.startswith("BLACK") or line.startswith("BLACK")): 
      temp_data_list[5] = line 
     if (line.startswith("YELLOW") or line.startswith("YELLOW")): 
      temp_data_list[4] = line 
     if (line.startswith("BLUE") or line.startswith("BLUE")): 
      temp_data_list[3] = line 
     if (line.startswith("GREEN") or line.startswith("GREEN")): 
      temp_data_list[2] = line 
     if (line.startswith("RED") or line.startswith("RED")): 
      temp_data_list[1] = line 
     if (line.startswith("ID") or line.find("ID") > 0): 
      temp_data_list[0] = line 
    if line == '': 
     temp_data_str = "" 
     for temp_data in temp_data_list: 
      temp_data_str += temp_data + "," 
     temp_data_str = temp_data_str[0:-1] + "\n" 
     writefile.write(temp_data_str) 

     temp_data_list = ["",]*7 
     td_index = 0 

if temp_data_list[0]: 
    temp_data_str = "" 
    for temp_data in temp_data_list: 
     temp_data_str += temp_data + "," 
    temp_data_str = temp_data_str[0:-1] + "\n" 
    writefile.write(temp_data_str) 
readfile.close() 
writefile.close()

來源

2012-12-11 the dave

你嘗試過這麼遠嗎？標準庫'csv'模塊可能是一個很好的開始。 –

我知道你說你想要一個Python解決方案，但你有沒有考慮R？它是專爲這些類型的任務 – Stedy

，我會confesss我新的編程，我試圖用這個... http://ubuntuforums.org/showpost.php?p=6159649&postcount=4 但我一直得到這個錯誤。 IndexError：列表分配索引超出範圍現在我才知道這是因爲數據是如何格式化我會看看在r –

這是假設的Python < 2.7（因此沒有利用與內置打開多個文件與一個with，寫頭-in writeheaders等。請注意，爲了使它正常工作，我刪除了CSV中逗號之間的空格。正如@JamesHenstridge所提到的那樣，肯定值得在csv模塊上進行閱讀，以便這樣做更有意義

import csv 

with open('testfile', 'rb') as f: 
    with open('outcsv.csv', 'wb') as o: 
    # Specify your field names 
    fieldnames = ('id', 'red', 'green', 'blue', 'yellow', 'black') 

    # Here we create a DictWriter, since your data is suited for one 
    writer = csv.DictWriter(o, fieldnames=fieldnames) 

    # Write the header row 
    writer.writerow(dict((h, h) for h in fieldnames)) 

    # General idea here is to build a row until we hit a blank line, 
    # at which point we write our current row and continue 
    new_row = {} 
    for line in f.readlines(): 
     # This will split the line on a comma/space combo and then 
     # Strip off any commas/spaces that end a word 
     row = [x.strip(', ') for x in line.strip().split(', ')] 
     if not row[0]: 
     writer.writerow(new_row) 
     new_row = {} 
     else: 
     # Here we write a blank string if there is no corresponding value; 
     # otherwise, write the value 
     new_row[row[0]] = '' if len(row) == 1 else row[1].strip() 

    # Check new_row - if not blank, it hasn't been written (so write) 
    if new_row: 
     writer.writerow(new_row)

使用您的數據上面（有扔在一些隨機逗號分隔的數字），這寫道：

id,red,green,blue,yellow,black 
x,"2,8","2,4",x,,x 
x,,,"4,3",x,x 
x,x,x,x,,x 
x,x,x,x,, 
x,x,x,x,,x

來源

2012-12-11 03:23:02 RocketDonkey

你錯過了'for'語句中的'if'開始發言？ –

@JamesHenstridge哈耶，不知道怎麼沒有被粘貼。稍後會更新，謝謝指出。 – RocketDonkey

文本和逗號之間有隨機空格，是否有辦法讓它檢測空格並將其刪除？ –

Python來提取和排序文件中的數據

回答

相關問題