我需要使用AWK解析CSV文件。在CSV一條線看起來是這樣的:正則表達式解析一個有趣的CSV?
"hello, world?",1 thousand,"oneword",,,"last one"
一些重要意見: 已場內引號字符串中可以包含逗號和多個單詞 -unquoted字段可以是多個世界 已場內可以通過只具有空連續兩個逗號
任何關於編寫正則表達式來正確拆分這條線的線索?
謝謝!
我需要使用AWK解析CSV文件。在CSV一條線看起來是這樣的:正則表達式解析一個有趣的CSV?
"hello, world?",1 thousand,"oneword",,,"last one"
一些重要意見: 已場內引號字符串中可以包含逗號和多個單詞 -unquoted字段可以是多個世界 已場內可以通過只具有空連續兩個逗號
任何關於編寫正則表達式來正確拆分這條線的線索?
謝謝!
正如許多人所觀察到的,CSV比第一次出現的格式要難一些。有許多邊緣案例和含糊之處。作爲一個例子,在你的例子中,歧義是',,,',一個逗號或兩個空白字段的字段?
Perl,python,Java等能夠更好地處理CSV,因爲它們具有相同的經過良好測試的庫。正則表達式會更脆弱。
有了AWK,我用THIS AWK函數取得了一些成功。它在AWK,gawk和nawk下工作。
#!/usr/bin/awk -f
#**************************************************************************
#
# This file is in the public domain.
#
# For more information email [email protected]
# Or see http://lorance.freeshell.org/csv/
#
# Parse a CSV string into an array.
# The number of fields found is returned.
# In the event of an error a negative value is returned and csverr is set to
# the error. See below for the error values.
#
# Parameters:
# string = The string to parse.
# csv = The array to parse the fields into.
# sep = The field separator character. Normally ,
# quote = The string quote character. Normally "
# escape = The quote escape character. Normally "
# newline = Handle embedded newlines. Provide either a newline or the
# string to use in place of a newline. If left empty embedded
# newlines cause an error.
# trim = When true spaces around the separator are removed.
# This affects parsing. Without this a space between the
# separator and quote result in the quote being ignored.
#
# These variables are private:
# fields = The number of fields found thus far.
# pos = Where to pull a field from the string.
# strtrim = True when a string is found so we know to remove the quotes.
#
# Error conditions:
# -1 = Unable to read the next line.
# -2 = Missing end quote.
# -3 = Missing separator.
#
# Notes:
# The code assumes that every field is preceded by a separator, even the
# first field. This makes the logic much simpler, but also requires a
# separator be prepended to the string before parsing.
#**************************************************************************
function parse_csv(string,csv,sep,quote,escape,newline,trim, fields,pos,strtrim) {
# Make sure there is something to parse.
if (length(string) == 0) return 0;
string = sep string; # The code below assumes ,FIELD.
fields = 0; # The number of fields found thus far.
while (length(string) > 0) {
# Remove spaces after the separator if requested.
if (trim && substr(string, 2, 1) == " ") {
if (length(string) == 1) return fields;
string = substr(string, 2);
continue;
}
strtrim = 0; # Used to trim quotes off strings.
# Handle a quoted field.
if (substr(string, 2, 1) == quote) {
pos = 2;
do {
pos++
if (pos != length(string) &&
substr(string, pos, 1) == escape &&
(substr(string, pos + 1, 1) == quote ||
substr(string, pos + 1, 1) == escape)) {
# Remove escaped quote characters.
string = substr(string, 1, pos - 1) substr(string, pos + 1);
} else if (substr(string, pos, 1) == quote) {
# Found the end of the string.
strtrim = 1;
} else if (newline && pos >= length(string)) {
# Handle embedded newlines if requested.
if (getline == -1) {
csverr = "Unable to read the next line.";
return -1;
}
string = string newline $0;
}
} while (pos < length(string) && strtrim == 0)
if (strtrim == 0) {
csverr = "Missing end quote.";
return -2;
}
} else {
# Handle an empty field.
if (length(string) == 1 || substr(string, 2, 1) == sep) {
csv[fields] = "";
fields++;
if (length(string) == 1)
return fields;
string = substr(string, 2);
continue;
}
# Search for a separator.
pos = index(substr(string, 2), sep);
# If there is no separator the rest of the string is a field.
if (pos == 0) {
csv[fields] = substr(string, 2);
fields++;
return fields;
}
}
# Remove spaces after the separator if requested.
if (trim && pos != length(string) && substr(string, pos + strtrim, 1) == " ") {
trim = strtrim
# Count the number fo spaces found.
while (pos < length(string) && substr(string, pos + trim, 1) == " ") {
trim++
}
# Remove them from the string.
string = substr(string, 1, pos + strtrim - 1) substr(string, pos + trim);
# Adjust pos with the trimmed spaces if a quotes string was not found.
if (!strtrim) {
pos -= trim;
}
}
# Make sure we are at the end of the string or there is a separator.
if ((pos != length(string) && substr(string, pos + 1, 1) != sep)) {
csverr = "Missing separator.";
return -3;
}
# Gather the field.
csv[fields] = substr(string, 2 + strtrim, pos - (1 + strtrim * 2));
fields++;
# Remove the field from the string for the next pass.
string = substr(string, pos + 1);
}
return fields;
}
{
num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
if (num_fields < 0) {
printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0;
} else {
printf "%s -> \n", $0;
printf "%s fields\n", num_fields;
for (i = 0;i < num_fields;i++) {
printf "%s\n", csv[i];
}
printf "|\n";
}
}
它運行在您的示例數據產生:
"hello, world?",1 thousand,"oneword",,,"last one" ->
6 fields
hello, world?
1 thousand
oneword
last one
|
一個例子Perl的解決方案:
$ echo '"hello, world?",1 thousand,"oneword",,,"last one"' |
perl -lnE 'for(/(?:^|,)("(?:[^"]+|"")*"|[^,]*)/g) { s/"$//; s/""/"/g if (s/^"//);
say}'
試試這個:
^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$
我還沒有測試AWK它雖然。
你真的需要用awk?有很多語言內置CSV解析器。 – nmichaels 2010-12-01 22:52:44
(?:^ |,)(「(?:[^」] + |「」)*「| [^,] *) – dawg 2010-12-01 22:53:52