2017-09-17 22 views
0

這裏是我的Python 3代碼:蟒蛇3.X串聯壓縮的CSV文件到一個非壓縮的csv文件

import zipfile 
import os 
import time 
from timeit import default_timer as timer 
import re 
import glob 
import pandas as pd 


# local variabless 
# pc version 
# the_dir = r'c:\ImpExpData' 
# linux version 
the_dir = '/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95' 


def main(): 
    """ 
    this is the function that controls the processing 
    """ 
    start_time = timer() 
    for root, dirs, files in os.walk(the_dir): 
     for file in files: 
      if file.endswith(".zip"): 
       print("working dir is ...", the_dir) 
       zipPath = os.path.join(root, file) 
       z = zipfile.ZipFile(zipPath, "r") 
       for filename in z.namelist(): 
        if filename.endswith(".csv"): 
         # print filename 
         if re.match(r'^Trade-Geo.*\.csv$', filename): 
          pass # do somethin with geo file 
         # print " Geo data: " , filename 
         elif re.match(r'^Trade-Metadata.*\.csv$', filename): 
          pass # do something with metadata file 
         # print "Metadata: ", filename 
         else: 
          try: 
           with zipfile.ZipFile(zipPath) as z: 
            with z.open(filename) as f: 
             # print("send to test def...", filename) 
             # print(zipPath) 
             with zipfile.ZipFile(zipPath) as z: 
              with z.open(filename) as f: 
               frame = pd.DataFrame() 
               # EmptyDataError: No columns to parse from file -- how to deal with this error 
               train_df = read_csv(f, index_col=None, header=0, skiprows=1, encoding="cp1252") 
               # train_df = pd.read_csv(f, header=0, skiprows=1, delimiter=",", encoding="cp1252") 
               list_ = [] 
               list_.append(train_df) 
               # print(list_) 
               frame = pd.concat(list_, ignore_index=True) 
               frame.to_csv('/home/ralph/Documents/lulumcusb/ImpExpData/Exports/concat_test.csv', encoding='cp1252') # works 
          except: # catches EmptyDataError: No columns to parse from file 
           print("EmptyDataError...." ,filename, "...", zipPath) 

# GetSubDirList(the_dir) 
    end_time = timer() 
    print("Elapsed time was %g seconds" % (end_time - start_time)) 


if __name__ == '__main__': 
    main() 

它主要工作 - 只是它不會串連所有壓縮的CSV文件合併爲一個。有一個空文件,所有csv文件具有相同的字段結構,所有csv文件的行數都不相同。

這裏是當我運行它什麼Spyder的報告:

runfile('/home/ralph/Documents/lulumcusb/Sep15_cocncatCSV.py', wdir='/home/ralph/Documents/lulumcusb') 

working dir is ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95 

EmptyDataError.... Trade-Exports-Chp-77.csv ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95/Trade-Exports-Yr1992-1995.zip 

/home/ralph/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py:688: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False. 
    execfile(filename, namespace) 

Elapsed time was 104.857 seconds 

最終csvfile是處理的最後一個壓縮csv文件;在尺寸csv文件的變化,因爲它處理這些文件

有在壓縮文件99個的CSV文件,我希望Concat的到一個非壓縮的CSV文件

字段或列的名稱是: colmNames = [「hs_code」,「uom」,「country」,「state」,「prov」,「value」,「quatity」,「year」,「month」]

csvfiles標記:chp01.csv, cht02.csv等chp99.csv與「uom」(度量單位)是空的,或者整數或字符串取決於hs_code

問:如何獲取壓縮的c sv文件連接成一個大的(估計100 MB未壓縮的)csv文件?

添加詳細信息: 我想解壓縮CSV文件,然後我必須去刪除它們。我需要連接文件,因爲我有額外的處理。壓縮的CSV文件的提取是一個可行的選擇,我希望不必這樣做

回答

0

是否有任何理由你不想用你的shell做到這一點?

假設在其中串聯的順序是無關緊要的:

cd "/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95" 
unzip "Trade-Exports-Yr1992-1995.zip" -d unzipped && cd unzipped 
for f in Trade-Exports-Chp*.csv; do tail --lines=+2 "$f" >> concat.csv; done 

這消除來自每個csv文件的第一行(列名稱)附加到concat.csv之前。

如果你只是做:

tail --lines=+2 "Trade-Exports-Chp*.csv" > concat.csv 

你會結束:

==> Trade-Exports-Chp-1.csv <== 
... 

==> Trade-Exports-Chp-10.csv <== 
... 

==> Trade-Exports-Chp-2.csv <== 
... 

etc. 

如果你關心的秩序,改變Trade-Exports-Chp-1.csv .. Trade-Exports-Chp-9.csvTrade-Exports-Chp-01.csv .. Trade-Exports-Chp-09.csv

儘管它在Python中可行,但我認爲這不是在這種情況下工作的正確工具。


如果你想要做的到位工作,而無需實際提取zip文件:

for i in {1..99}; do 
    unzip -p "Trade-Exports-Yr1992-1995.zip" "Trade-Exports-Chp$i.csv" | tail --lines=+2 >> concat.csv 
done 
+0

好吧,我得到了提供shell腳本工作;如果我想在python中做同樣的事情,我該怎麼做?其他的stackoverflow項目建議熊貓concat其次是熊貓to_csv的路線作品,但它不適合我。有什麼我錯過了嗎? – rspaans