2010-12-01 45 views
1

我需要使用AWK解析CSV文件。在CSV一條線看起來是這樣的:正則表達式解析一個有趣的CSV?

"hello, world?",1 thousand,"oneword",,,"last one" 

一些重要意見: 已場內引號字符串中可以包含逗號和多個單詞 -unquoted字段可以是多個世界 已場內可以通過只具有空連續兩個逗號

任何關於編寫正則表達式來正確拆分這條線的線索?

謝謝!

+7

你真的需要用awk?有很多語言內置CSV解析器。 – nmichaels 2010-12-01 22:52:44

+0

(?:^ |,)(「(?:[^」] + |「」)*「| [^,] *) – dawg 2010-12-01 22:53:52

回答

3

正如許多人所觀察到的,CSV比第一次出現的格式要難一些。有許多邊緣案例和含糊之處。作爲一個例子,在你的例子中,歧義是',,,',一個逗號或兩個空白字段的字段?

Perl,python,Java等能夠更好地處理CSV,因爲它們具有相同的經過良好測試的庫。正則表達式會更脆弱。

有了AWK,我用THIS AWK函數取得了一些成功。它在AWK,gawk和nawk下工作。

#!/usr/bin/awk -f 
#************************************************************************** 
# 
# This file is in the public domain. 
# 
# For more information email [email protected] 
# Or see http://lorance.freeshell.org/csv/ 
# 
# Parse a CSV string into an array. 
# The number of fields found is returned. 
# In the event of an error a negative value is returned and csverr is set to 
# the error. See below for the error values. 
# 
# Parameters: 
# string = The string to parse. 
# csv  = The array to parse the fields into. 
# sep  = The field separator character. Normally , 
# quote = The string quote character. Normally " 
# escape = The quote escape character. Normally " 
# newline = Handle embedded newlines. Provide either a newline or the 
#   string to use in place of a newline. If left empty embedded 
#   newlines cause an error. 
# trim = When true spaces around the separator are removed. 
#   This affects parsing. Without this a space between the 
#   separator and quote result in the quote being ignored. 
# 
# These variables are private: 
# fields = The number of fields found thus far. 
# pos  = Where to pull a field from the string. 
# strtrim = True when a string is found so we know to remove the quotes. 
# 
# Error conditions: 
# -1 = Unable to read the next line. 
# -2 = Missing end quote. 
# -3 = Missing separator. 
# 
# Notes: 
# The code assumes that every field is preceded by a separator, even the 
# first field. This makes the logic much simpler, but also requires a 
# separator be prepended to the string before parsing. 
#************************************************************************** 
function parse_csv(string,csv,sep,quote,escape,newline,trim, fields,pos,strtrim) { 
    # Make sure there is something to parse. 
    if (length(string) == 0) return 0; 
    string = sep string; # The code below assumes ,FIELD. 
    fields = 0; # The number of fields found thus far. 
    while (length(string) > 0) { 
     # Remove spaces after the separator if requested. 
     if (trim && substr(string, 2, 1) == " ") { 
      if (length(string) == 1) return fields; 
      string = substr(string, 2); 
      continue; 
     } 
     strtrim = 0; # Used to trim quotes off strings. 
     # Handle a quoted field. 
     if (substr(string, 2, 1) == quote) { 
      pos = 2; 
      do { 
       pos++ 
       if (pos != length(string) && 
        substr(string, pos, 1) == escape && 
        (substr(string, pos + 1, 1) == quote || 
        substr(string, pos + 1, 1) == escape)) { 
        # Remove escaped quote characters. 
        string = substr(string, 1, pos - 1) substr(string, pos + 1); 
       } else if (substr(string, pos, 1) == quote) { 
        # Found the end of the string. 
        strtrim = 1; 
       } else if (newline && pos >= length(string)) { 
        # Handle embedded newlines if requested. 
        if (getline == -1) { 
         csverr = "Unable to read the next line."; 
         return -1; 
        } 
        string = string newline $0; 
       } 
      } while (pos < length(string) && strtrim == 0) 
      if (strtrim == 0) { 
       csverr = "Missing end quote."; 
       return -2; 
      } 
     } else { 
      # Handle an empty field. 
      if (length(string) == 1 || substr(string, 2, 1) == sep) { 
       csv[fields] = ""; 
       fields++; 
       if (length(string) == 1) 
        return fields; 
       string = substr(string, 2); 
       continue; 
      } 
      # Search for a separator. 
      pos = index(substr(string, 2), sep); 
      # If there is no separator the rest of the string is a field. 
      if (pos == 0) { 
       csv[fields] = substr(string, 2); 
       fields++; 
       return fields; 
      } 
     } 
     # Remove spaces after the separator if requested. 
     if (trim && pos != length(string) && substr(string, pos + strtrim, 1) == " ") { 
      trim = strtrim 
      # Count the number fo spaces found. 
      while (pos < length(string) && substr(string, pos + trim, 1) == " ") { 
       trim++ 
      } 
      # Remove them from the string. 
      string = substr(string, 1, pos + strtrim - 1) substr(string, pos + trim); 
      # Adjust pos with the trimmed spaces if a quotes string was not found. 
      if (!strtrim) { 
       pos -= trim; 
      } 
     } 
     # Make sure we are at the end of the string or there is a separator. 
     if ((pos != length(string) && substr(string, pos + 1, 1) != sep)) { 
      csverr = "Missing separator."; 
      return -3; 
     } 
     # Gather the field. 
     csv[fields] = substr(string, 2 + strtrim, pos - (1 + strtrim * 2)); 
     fields++; 
     # Remove the field from the string for the next pass. 
     string = substr(string, pos + 1); 
    } 
    return fields; 
} 

{ 
    num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1); 
    if (num_fields < 0) { 
     printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0; 
    } else { 
     printf "%s -> \n", $0; 
     printf "%s fields\n", num_fields; 
     for (i = 0;i < num_fields;i++) { 
      printf "%s\n", csv[i]; 
     } 
     printf "|\n"; 
    } 
} 

它運行在您的示例數據產生:

"hello, world?",1 thousand,"oneword",,,"last one" -> 
6 fields 
hello, world? 
1 thousand 
oneword 


last one 
| 

一個例子Perl的解決方案:

$ echo '"hello, world?",1 thousand,"oneword",,,"last one"' | 
perl -lnE 'for(/(?:^|,)("(?:[^"]+|"")*"|[^,]*)/g) { s/"$//; s/""/"/g if (s/^"//); 
say}' 
0

試試這個:

^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$ 

我還沒有測試AWK它雖然。