蟒蛇3.X串聯壓縮的CSV文件到一個非壓縮的csv文件

這裏是我的Python 3代碼：蟒蛇3.X串聯壓縮的CSV文件到一個非壓縮的csv文件

import zipfile 
import os 
import time 
from timeit import default_timer as timer 
import re 
import glob 
import pandas as pd 


# local variabless 
# pc version 
# the_dir = r'c:\ImpExpData' 
# linux version 
the_dir = '/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95' 


def main(): 
    """ 
    this is the function that controls the processing 
    """ 
    start_time = timer() 
    for root, dirs, files in os.walk(the_dir): 
     for file in files: 
      if file.endswith(".zip"): 
       print("working dir is ...", the_dir) 
       zipPath = os.path.join(root, file) 
       z = zipfile.ZipFile(zipPath, "r") 
       for filename in z.namelist(): 
        if filename.endswith(".csv"): 
         # print filename 
         if re.match(r'^Trade-Geo.*\.csv$', filename): 
          pass # do somethin with geo file 
         # print " Geo data: " , filename 
         elif re.match(r'^Trade-Metadata.*\.csv$', filename): 
          pass # do something with metadata file 
         # print "Metadata: ", filename 
         else: 
          try: 
           with zipfile.ZipFile(zipPath) as z: 
            with z.open(filename) as f: 
             # print("send to test def...", filename) 
             # print(zipPath) 
             with zipfile.ZipFile(zipPath) as z: 
              with z.open(filename) as f: 
               frame = pd.DataFrame() 
               # EmptyDataError: No columns to parse from file -- how to deal with this error 
               train_df = read_csv(f, index_col=None, header=0, skiprows=1, encoding="cp1252") 
               # train_df = pd.read_csv(f, header=0, skiprows=1, delimiter=",", encoding="cp1252") 
               list_ = [] 
               list_.append(train_df) 
               # print(list_) 
               frame = pd.concat(list_, ignore_index=True) 
               frame.to_csv('/home/ralph/Documents/lulumcusb/ImpExpData/Exports/concat_test.csv', encoding='cp1252') # works 
          except: # catches EmptyDataError: No columns to parse from file 
           print("EmptyDataError...." ,filename, "...", zipPath) 

# GetSubDirList(the_dir) 
    end_time = timer() 
    print("Elapsed time was %g seconds" % (end_time - start_time)) 


if __name__ == '__main__': 
    main()

它主要工作 - 只是它不會串連所有壓縮的CSV文件合併爲一個。有一個空文件，所有csv文件具有相同的字段結構，所有csv文件的行數都不相同。

這裏是當我運行它什麼Spyder的報告：

runfile('/home/ralph/Documents/lulumcusb/Sep15_cocncatCSV.py', wdir='/home/ralph/Documents/lulumcusb') 

working dir is ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95 

EmptyDataError.... Trade-Exports-Chp-77.csv ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95/Trade-Exports-Yr1992-1995.zip 

/home/ralph/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py:688: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False. 
    execfile(filename, namespace) 

Elapsed time was 104.857 seconds

最終csvfile是處理的最後一個壓縮csv文件;在尺寸csv文件的變化，因爲它處理這些文件

有在壓縮文件99個的CSV文件，我希望Concat的到一個非壓縮的CSV文件

字段或列的名稱是： colmNames = [「hs_code」，「uom」，「country」，「state」，「prov」，「value」，「quatity」，「year」，「month」]

csvfiles標記：chp01.csv， cht02.csv等chp99.csv與「uom」（度量單位）是空的，或者整數或字符串取決於hs_code

問：如何獲取壓縮的c sv文件連接成一個大的（估計100 MB未壓縮的）csv文件？

添加詳細信息：我想解壓縮CSV文件，然後我必須去刪除它們。我需要連接文件，因爲我有額外的處理。壓縮的CSV文件的提取是一個可行的選擇，我希望不必這樣做

來源

2017-09-17 rspaans

是否有任何理由你不想用你的shell做到這一點？

假設在其中串聯的順序是無關緊要的：

cd "/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95" 
unzip "Trade-Exports-Yr1992-1995.zip" -d unzipped && cd unzipped 
for f in Trade-Exports-Chp*.csv; do tail --lines=+2 "$f" >> concat.csv; done

這消除來自每個csv文件的第一行（列名稱）附加到concat.csv之前。

如果你只是做：

tail --lines=+2 "Trade-Exports-Chp*.csv" > concat.csv

你會結束：

==> Trade-Exports-Chp-1.csv <== 
... 

==> Trade-Exports-Chp-10.csv <== 
... 

==> Trade-Exports-Chp-2.csv <== 
... 

etc.

如果你關心的秩序，改變Trade-Exports-Chp-1.csv .. Trade-Exports-Chp-9.csv到Trade-Exports-Chp-01.csv .. Trade-Exports-Chp-09.csv。

儘管它在Python中可行，但我認爲這不是在這種情況下工作的正確工具。

如果你想要做的到位工作，而無需實際提取zip文件：

for i in {1..99}; do 
    unzip -p "Trade-Exports-Yr1992-1995.zip" "Trade-Exports-Chp$i.csv" | tail --lines=+2 >> concat.csv 
done

來源

2017-09-17 22:46:13 rjsberry

好吧，我得到了提供shell腳本工作;如果我想在python中做同樣的事情，我該怎麼做？其他的stackoverflow項目建議熊貓concat其次是熊貓to_csv的路線作品，但它不適合我。有什麼我錯過了嗎？ – rspaans

蟒蛇3.X串聯壓縮的CSV文件到一個非壓縮的csv文件

回答

相關問題