csv模塊不會同時處理「and」引號的情況,如果沒有提供這種方言的模塊,就必須進入解析業務。爲了避免依賴第三方模塊中,我們可以使用re
模塊進行詞法分析,使用re.MatchObject.lastindex噱頭將令牌類型與匹配的模式相關聯
以下代碼作爲腳本運行時會通過所有顯示的測試,與Python 2.7和2.2
import re
# lexical token symbols
DQUOTED, SQUOTED, UNQUOTED, COMMA, NEWLINE = xrange(5)
_pattern_tuples = (
(r'"[^"]*"', DQUOTED),
(r"'[^']*'", SQUOTED),
(r",", COMMA),
(r"$", NEWLINE), # matches end of string OR \n just before end of string
(r"[^,\n]+", UNQUOTED), # order in the above list is important
)
_matcher = re.compile(
'(' + ')|('.join([i[0] for i in _pattern_tuples]) + ')',
).match
_toktype = [None] + [i[1] for i in _pattern_tuples]
# need dummy at start because re.MatchObject.lastindex counts from 1
def csv_split(text):
"""Split a csv string into a list of fields.
Fields may be quoted with " or ' or be unquoted.
An unquoted string can contain both a " and a ', provided neither is at
the start of the string.
A trailing \n will be ignored if present.
"""
fields = []
pos = 0
want_field = True
while 1:
m = _matcher(text, pos)
if not m:
raise ValueError("Problem at offset %d in %r" % (pos, text))
ttype = _toktype[m.lastindex]
if want_field:
if ttype in (DQUOTED, SQUOTED):
fields.append(m.group(0)[1:-1])
want_field = False
elif ttype == UNQUOTED:
fields.append(m.group(0))
want_field = False
elif ttype == COMMA:
fields.append("")
else:
assert ttype == NEWLINE
fields.append("")
break
else:
if ttype == COMMA:
want_field = True
elif ttype == NEWLINE:
break
else:
print "*** Error dump ***", ttype, repr(m.group(0)), fields
raise ValueError("Missing comma at offset %d in %r" % (pos, text))
pos = m.end(0)
return fields
if __name__ == "__main__":
tests = (
("""hey,hello,,"hello,world",'hey,world'\n""", ['hey', 'hello', '', 'hello,world', 'hey,world']),
("""\n""", ['']),
("""""", ['']),
("""a,b\n""", ['a', 'b']),
("""a,b""", ['a', 'b']),
(""",,,\n""", ['', '', '', '']),
("""a,contains both " and ',c""", ['a', 'contains both " and \'', 'c']),
("""a,'"starts with "...',c""", ['a', '"starts with "...', 'c']),
)
for text, expected in tests:
result = csv_split(text)
print
print repr(text)
print repr(result)
print repr(expected)
print result == expected
如果你指定你想在某些情況下發生的,你簡單的例子,不包括什麼這將是非常有益的:(1)`'abcd'efgh (2)`'abcd'「efgh」`(3)`abcd「efgh」` - 你想讓它從每一個(WITH QUOTES UNSTRIPPED)產生一個字段還是一個異常或其他東西? – 2011-02-14 20:40:28
另外,假設你的輸入文件是通過查詢客戶數據庫並用一個不合情理的地址行產生的,如'O'Drien'Road的'Dunromin',那麼在輸入文件中如何引用/轉義? – 2011-02-14 23:57:08