2013-07-23 84 views
1

我具有其中的數據被格式化爲CSV文件如下:組合多個CSV文件到一個單一的一個

file1.csv

ID,NAME 
001,Jhon 
002,Doe 

fille2.csv

ID,SCHOOLS_ATTENDED 
001,my Nice School 
002,His lovely school 

file3.csv

ID,SALARY 
001,25 
002,40 

ID字段是一種將用於獲取記錄的主鍵。

什麼是讀取3到4個文件並獲取相應數據並存儲在另一個具有標題(ID,NAME,SCHOOLS_ATTENDED,SALARY)的CSV文件中的最有效方式?

文件大小爲幾百MB(100,200 Mb)。

+0

爲什麼有人會downvote呢??? – Volatil3

+0

也許是因爲它表明你缺乏研究工作?不過,這不是我。 –

+0

我認爲這是一個重複的問題。在開新問題之前,你應該總是搜索它。順便說一句,這不是我!http://stackoverflow.com/questions/17586573/python-combing-data-from-different-csv-files-into-one/17588521#17588521 –

回答

3

數百兆字節沒有那麼多。爲什麼使用不是去一個簡單的方法的csv modulecollections.defaultdict

import csv 
from collections import defaultdict 

result = defaultdict(dict) 
fieldnames = {"ID"} 

for csvfile in ("file1.csv", "file2.csv", "file3.csv"): 
    with open(csvfile, newline="") as infile: 
     reader = csv.DictReader(infile) 
     for row in reader: 
      id = row.pop("ID") 
      for key in row: 
       fieldnames.add(key) # wasteful, but I don't care enough 
       result[id][key] = row[key] 

產生的defaultdict看起來是這樣的:

>>> result 
defaultdict(<type 'dict'>, 
{'001': {'SALARY': '25', 'SCHOOLS_ATTENDED': 'my Nice School', 'NAME': 'Jhon'}, 
'002': {'SALARY': '40', 'SCHOOLS_ATTENDED': 'His lovely school', 'NAME': 'Doe'}}) 

然後,您可以合併到這一個CSV文件(不是我最漂亮的工作,但好夠了):

with open("out.csv", "w", newline="") as outfile: 
    writer = csv.DictWriter(outfile, sorted(fieldnames)) 
    writer.writeheader() 
    for item in result: 
     result[item]["ID"] = item 
     writer.writerow(result[item]) 

out.csv則包含

ID,NAME,SALARY,SCHOOLS_ATTENDED 
001,Jhon,25,my Nice School 
002,Doe,40,His lovely school 
+0

謝謝你,但你的代碼給錯誤** csv.Error:迭代器應該返回字符串,而不是字節(你是否在文本模式下打開文件?)*** – Volatil3

+1

@ Volatil3:我只注意到你在Python 3上;我已經編輯了相應的程序。請再試一次。 –

+0

我剛剛注意到分隔符是**〜** – Volatil3

0

以下是將多個csv文件與其名稱中的特定關鍵字組合成1個最終csv文件的工作代碼。我已經將default關鍵字設置爲「file」,但是如果您想合併來自folder_path的所有csv文件,可以將其設置爲空白。此代碼將從您的第一個csv文件獲取標題,並將其用作最終組合的csv文件中的標題。它會忽略所有其他csv文件的標題。

import glob,os 
@staticmethod 
def Combine_multiple_csv_files_thatContainsKeywordInTheirNames_into_one_csv_file(folder_path,keyword='file'): 
    #takes header only from 1st csv, all other csv headers are skipped and data is appened to final csv 

    fileNames = glob.glob(folder_path + "*" + keyword + "*"+".csv") # fileNames INCLUDES FOLDER_PATH TOO 
    with open(folder_path+"Combined_csv.csv", "w", newline='') as fout: 
     print('Combining multiple csv files into 1') 
     csv_write_file = csv.writer(fout, delimiter=',') 
     # a.writerows(op) 
     with open(fileNames[0], mode='rt') as read_file: # utf8 
      csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT 
      csv_write_file.writerows(csv_read_file) 

     for num in range(1, len(fileNames)): 
      with open(fileNames[num], mode='rt') as read_file: # utf8 
       csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT 
       next(csv_read_file) # ignore header 
       csv_write_file.writerows(csv_read_file) 
相關問題