我是一名Python初學者,並且製作了一些基本腳本。我最近面臨的挑戰是採用一個非常大的csv文件(10gb +),並根據每行中特定變量的值將其分割成多個較小的文件。使用Python分割基於特定列的csv文件
例如,該文件可能是這樣的:
Category,Title,Sales
"Books","Harry Potter",1441556
"Books","Lord of the Rings",14251154
"Series", "Breaking Bad",6246234
"Books","The Alchemist",12562166
"Movie","Inception",1573437
而且我希望將文件分割成單獨的文件: Books.csv,Series.csv,Movie.csv
在現實中會有數百個類別,而且他們不會被排序。在這種情況下,他們在第一列,但將來他們可能不會。
我已經在網上找到了一些解決方案,但沒有在Python中。有一個非常簡單的AWK命令可以在一行中完成,但是我無法在工作中訪問AWK。
我寫了下面的代碼,它可以工作,但我認爲它可能是非常低效的。任何人都可以建議如何加快速度?
import csv
#Creates empty set - this will be used to store the values that have already been used
filelist = set()
#Opens the large csv file in "read" mode
with open('//directory/largefile', 'r') as csvfile:
#Read the first row of the large file and store the whole row as a string (headerstring)
read_rows = csv.reader(csvfile)
headerrow = next(read_rows)
headerstring=','.join(headerrow)
for row in read_rows:
#Store the whole row as a string (rowstring)
rowstring=','.join(row)
#Defines filename as the first entry in the row - This could be made dynamic so that the user inputs a column name to use
filename = (row[0])
#This basically makes sure it is not looking at the header row.
if filename != "Category":
#If the filename is not in the filelist set, add it to the list and create new csv file with header row.
if filename not in filelist:
filelist.add(filename)
with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
f.write(headerstring)
f.write("\n")
f.close()
#If the filename is in the filelist set, append the current row to the existing csv file.
else:
with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
f.write(rowstring)
f.write("\n")
f.close()
謝謝!
爲什麼不使用'pandas'? – Dadep