使用Python分割基於特定列的csv文件

我是一名Python初學者，並且製作了一些基本腳本。我最近面臨的挑戰是採用一個非常大的csv文件（10gb +），並根據每行中特定變量的值將其分割成多個較小的文件。使用Python分割基於特定列的csv文件

例如，該文件可能是這樣的：

Category,Title,Sales 
"Books","Harry Potter",1441556 
"Books","Lord of the Rings",14251154 
"Series", "Breaking Bad",6246234 
"Books","The Alchemist",12562166 
"Movie","Inception",1573437

而且我希望將文件分割成單獨的文件： Books.csv，Series.csv，Movie.csv

在現實中會有數百個類別，而且他們不會被排序。在這種情況下，他們在第一列，但將來他們可能不會。

我已經在網上找到了一些解決方案，但沒有在Python中。有一個非常簡單的AWK命令可以在一行中完成，但是我無法在工作中訪問AWK。

我寫了下面的代碼，它可以工作，但我認爲它可能是非常低效的。任何人都可以建議如何加快速度？

import csv 

#Creates empty set - this will be used to store the values that have already been used 
filelist = set() 

#Opens the large csv file in "read" mode 
with open('//directory/largefile', 'r') as csvfile: 

    #Read the first row of the large file and store the whole row as a string (headerstring) 
    read_rows = csv.reader(csvfile) 
    headerrow = next(read_rows) 
    headerstring=','.join(headerrow) 

    for row in read_rows: 

     #Store the whole row as a string (rowstring) 
     rowstring=','.join(row) 

     #Defines filename as the first entry in the row - This could be made dynamic so that the user inputs a column name to use 
     filename = (row[0]) 

     #This basically makes sure it is not looking at the header row. 
     if filename != "Category": 

      #If the filename is not in the filelist set, add it to the list and create new csv file with header row. 
      if filename not in filelist:  
       filelist.add(filename) 
       with open('//directory/subfiles/' +str(filename)+'.csv','a') as f: 
        f.write(headerstring) 
        f.write("\n") 
        f.close()  
      #If the filename is in the filelist set, append the current row to the existing csv file.  
      else: 
       with open('//directory/subfiles/' +str(filename)+'.csv','a') as f: 
        f.write(rowstring) 
        f.write("\n") 
        f.close()

謝謝！

來源

2017-10-20 Actuary

爲什麼不使用'pandas'？ – Dadep

一種高效的內存方式，避免在這裏追加重新打開的文件（只要不打算生成大量的打開文件句柄）就是使用dict將類別映射到fileobj 。當該文件尚未打開，然後創建它，寫標題，然後總是寫的所有行到相應的文件，如：

import csv 

with open('somefile.csv') as fin:  
    csvin = csv.DictReader(fin) 
    # Category -> open file lookup 
    outputs = {} 
    for row in csvin: 
     cat = row['Category'] 
     # Open a new file and write the header 
     if cat not in outputs: 
      fout = open('{}.csv'.format(cat), 'w') 
      dw = csv.DictWriter(fout, fieldnames=csvin.fieldnames) 
      dw.writeheader() 
      outputs[cat] = fout, dw 
     # Always write the row 
     outputs[cat][1].writerow(row) 
    # Close all the files 
    for fout, _ in outputs.values(): 
     fout.close()

來源

2017-10-20 11:35:20

謝謝。在我看到您的解決方案之前，我設法想出了一些東西（請參閱原始帖子，我已更正了我的代碼，以使其可以正常工作）。您的方法是檢查它是一個新的類別還是不比我的效率更高？ – Actuary

@Actuary檢查沒有必要更快 - 但不打開/關閉/重新打開文件將減少大量的IO開銷 –

使用Python分割基於特定列的csv文件

回答

相關問題