這裏是我的Python 3代碼:蟒蛇3.X串聯壓縮的CSV文件到一個非壓縮的csv文件
import zipfile
import os
import time
from timeit import default_timer as timer
import re
import glob
import pandas as pd
# local variabless
# pc version
# the_dir = r'c:\ImpExpData'
# linux version
the_dir = '/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95'
def main():
"""
this is the function that controls the processing
"""
start_time = timer()
for root, dirs, files in os.walk(the_dir):
for file in files:
if file.endswith(".zip"):
print("working dir is ...", the_dir)
zipPath = os.path.join(root, file)
z = zipfile.ZipFile(zipPath, "r")
for filename in z.namelist():
if filename.endswith(".csv"):
# print filename
if re.match(r'^Trade-Geo.*\.csv$', filename):
pass # do somethin with geo file
# print " Geo data: " , filename
elif re.match(r'^Trade-Metadata.*\.csv$', filename):
pass # do something with metadata file
# print "Metadata: ", filename
else:
try:
with zipfile.ZipFile(zipPath) as z:
with z.open(filename) as f:
# print("send to test def...", filename)
# print(zipPath)
with zipfile.ZipFile(zipPath) as z:
with z.open(filename) as f:
frame = pd.DataFrame()
# EmptyDataError: No columns to parse from file -- how to deal with this error
train_df = read_csv(f, index_col=None, header=0, skiprows=1, encoding="cp1252")
# train_df = pd.read_csv(f, header=0, skiprows=1, delimiter=",", encoding="cp1252")
list_ = []
list_.append(train_df)
# print(list_)
frame = pd.concat(list_, ignore_index=True)
frame.to_csv('/home/ralph/Documents/lulumcusb/ImpExpData/Exports/concat_test.csv', encoding='cp1252') # works
except: # catches EmptyDataError: No columns to parse from file
print("EmptyDataError...." ,filename, "...", zipPath)
# GetSubDirList(the_dir)
end_time = timer()
print("Elapsed time was %g seconds" % (end_time - start_time))
if __name__ == '__main__':
main()
它主要工作 - 只是它不會串連所有壓縮的CSV文件合併爲一個。有一個空文件,所有csv文件具有相同的字段結構,所有csv文件的行數都不相同。
這裏是當我運行它什麼Spyder的報告:
runfile('/home/ralph/Documents/lulumcusb/Sep15_cocncatCSV.py', wdir='/home/ralph/Documents/lulumcusb')
working dir is ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95
EmptyDataError.... Trade-Exports-Chp-77.csv ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95/Trade-Exports-Yr1992-1995.zip
/home/ralph/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py:688: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
execfile(filename, namespace)
Elapsed time was 104.857 seconds
最終csvfile是處理的最後一個壓縮csv文件;在尺寸csv文件的變化,因爲它處理這些文件
有在壓縮文件99個的CSV文件,我希望Concat的到一個非壓縮的CSV文件
字段或列的名稱是: colmNames = [「hs_code」,「uom」,「country」,「state」,「prov」,「value」,「quatity」,「year」,「month」]
csvfiles標記:chp01.csv, cht02.csv等chp99.csv與「uom」(度量單位)是空的,或者整數或字符串取決於hs_code
問:如何獲取壓縮的c sv文件連接成一個大的(估計100 MB未壓縮的)csv文件?
添加詳細信息: 我想解壓縮CSV文件,然後我必須去刪除它們。我需要連接文件,因爲我有額外的處理。壓縮的CSV文件的提取是一個可行的選擇,我希望不必這樣做
好吧,我得到了提供shell腳本工作;如果我想在python中做同樣的事情,我該怎麼做?其他的stackoverflow項目建議熊貓concat其次是熊貓to_csv的路線作品,但它不適合我。有什麼我錯過了嗎? – rspaans