解析CSV文件，並在Linux中

執行轉換

我有很多列的大CSV文件（幾個100個MBS）：解析CSV文件，並在Linux中

1;18Jun2013;23:58:58;;;l;o;t;s;;;;o;f;;;;;o;t;h;e;r;;;;;c;o;l;u;m;n;s;;;;;

你看，第二列是，我希望有一個日期格式％Y-％m-％d，便於在數據庫中插入和排序。我相信在數據庫中轉換原始數據更簡單快捷。

主腳本使用bash。現在我已經進行了轉換，如下所示：

sed -n '2,$p' $TMPF | while read line; do 
     begin=$(echo "$line" | cut -d\; -f1) 
     origdate=$(echo "$line" | cut -d\; -f2) 
     #cache date translations, hash table for the poor 
     eval origdateh=h$origdate 
     if [ "x${!origdateh}" = "x" ]; then 
     # not cached till now, need to call date, then store 
      datex=$(date -d "$origdate" +%Y-%m-%d) 
      eval h$origdate="$datex" 
     else 
     # cache hit 
      datex=$(eval echo \$h$origdate) 
     fi 
     end=$(echo "$line" | cut -d\; -f3-) 
     echo "$begin;$datex;$end" >> $TMPF2 
    done

我用SED開始與第二線（1號線包含CSV頭），我相信所有與回聲和削減慢下來的子shell，所以「哈希表」真的沒有太大的用處...

誰能讓這個快點？

來源

2013-06-21 Marki

'誰可以讓它變得更快？'：只有你通過使用專用的CSV解析器。 – anubhava

這是否必須在bash中完成？如果你使用了比shell更好的數組/分割/連接實現（我在想ruby/perl/python），你可能會顯着提高速度。 – Joe

不要使用bash腳本，而要使用Python腳本。至少，這將更具可讀性/可維護性，可能更有效。

示例代碼可能看起來像這樣（未經）：

# file: converter.py 

import datetime 

def convert_line(line): 
    # split line on ';' 
    line = line.split(';') 
    # get the date part (second column) 
    # parse date from string 
    date = datetime.date.strptime(line[1], '%d%a%Y') 
    # convert to desired format 
    # replace item in line 
    line[1] = date.strftime('%Y-%m-%d') 
    # return converted line 
    return ';'.join(line) 

while True: 
    print convert_line(raw_input())

現在你只是做：

cat file.csv | python converter.py > file_converted.csv

替代實現：

# file: converter_2.py 

import datetime 

def convert_line(line): 
    # split line on ';' 
    line = line.split(';') 
    # get the date part (second column) 
    # parse date from string 
    date = datetime.date.strptime(line[1], '%d%a%Y') 
    # convert to desired format 
    # replace item in line 
    line[1] = date.strftime('%Y-%m-%d') 
    # return converted line 
    return ';'.join(line) 

with open('file.csv') as infile, open('file_converted.csv', 'w+') as outfile: 
    outfile.writelines(convert_line(line) for line in infile)

用法示例：

python converter_2.py

如果你在csv中有一些標題行，當然你不能用這個函數來轉換它們。

來源

2013-06-21 22:14:26 moooeeeep

速度非常符合我的喜好（幾分鐘內有700 MB文件）。 – Marki

謝謝，我試過第一個例子，以下似乎在bash腳本中調用時很好。

# file: converter.py 
import datetime 
def convert_line(line): 
    # split line on ';' 
    line = line.split(';') 
    # get the date part (second column) 
    # parse date from string 
    date = datetime.datetime.strptime(line[1], '%d%b%Y') 
    # convert to desired format 
    # replace item in line 
    line[1] = date.strftime('%Y-%m-%d') 
    # return converted line 
    return ';'.join(line) 
while True: 
    try: 
     print convert_line(raw_input()) 
    except (EOFError): 
     break

使用

tail +2 FILE | python csvconvert.py > xxx

忽略標題。

來源

2013-06-21 23:08:37 Marki

解析CSV文件，並在Linux中

回答

相關問題