2011-02-13 140 views
1

我想在Python中拆分逗號分隔的字符串。我這裏棘手的部分是數據本身的一些字段中有一個逗號,它們用引號引起來("')。生成的拆分字符串也應該在刪除的字段周圍加引號。另外,一些字段可以是空的。如何在Python中分隔逗號分隔的字符串,除了引號內的逗號之外

例子:

hey,hello,,"hello,world",'hey,world' 

需要被分成5個部分,如下面

['hey', 'hello', '', 'hello,world', 'hey,world'] 

任何想法/想法/建議/着如何去解決在Python上述問題將有助於非常感激。

謝謝你, Vish

+0

如果你指定你想在某些情況下發生的,你簡單的例子,不包括什麼這將是非常有益的:(1)`'abcd'efgh (2)`'abcd'「efgh」`(3)`abcd「efgh」` - 你想讓它從每一個(WITH QUOTES UNSTRIPPED)產生一個字段還是一個異常或其他東西? – 2011-02-14 20:40:28

+0

另外,假設你的輸入文件是通過查詢客戶數據庫並用一個不合情理的地址行產生的,如'O'Drien'Road的'Dunromin',那麼在輸入文件中如何引用/轉義? – 2011-02-14 23:57:08

回答

4

(編輯:原來的答案有麻煩的空字段t他邊由於道路re.findall的作品,所以我重構了一點,並添加測試。)

import re 

def parse_fields(text): 
    r""" 
    >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\'')) 
    ['hey', 'hello', '', 'hello,world', 'hey,world'] 
    >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\',')) 
    ['hey', 'hello', '', 'hello,world', 'hey,world', ''] 
    >>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\',')) 
    ['', 'hey', 'hello', '', 'hello,world', 'hey,world', ''] 
    >>> list(parse_fields('')) 
    [''] 
    >>> list(parse_fields(',')) 
    ['', ''] 
    >>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string')) 
    ['testing', 'quotes not at "the" beginning \'of\' the', 'string'] 
    >>> list(parse_fields('testing,"unterminated quotes')) 
    ['testing', '"unterminated quotes'] 
    """ 
    pos = 0 
    exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""") 
    while True: 
     m = exp.search(text, pos) 
     result = m.group(2) 
     separator = m.group(3) 

     yield result 

     if not separator: 
      break 

     pos = m.end(0) 

if __name__ == "__main__": 
    import doctest 
    doctest.testmod() 

(['"]?)匹配一個可選的單或雙引號。

(.*?)匹配字符串本身。這是一個非貪婪的比賽,可以根據需要進行匹配而不用吃整串。這被分配到result,這就是我們實際得到的結果。

\1是反向引用,以匹配我們之前匹配的相同單引號或雙引號(如果有的話)。

(,|$)匹配逗號分隔每個條目或行尾。這被分配到separator

如果分隔符是假的(例如空),那意味着沒有分隔符,所以我們在字符串的末尾 - 我們完成了。否則,我們根據正則表達式的完成位置(m.end(0))更新新的開始位置,然後繼續循環。

9

聽起來像是你想要的CSV模塊。

+1

-1聽起來像10個人(寫作時),他們沒有閱讀細則:兩個引號字符,例如``你好,世界','嘿,世界' - csv模塊不會那樣做。 – 2011-02-13 07:01:13

+1

@John:我們可能會對很多事情持不同意見,但是我有一種感覺,我們同意這裏的投票系統有時候有它,呃,弱點...... – 2011-02-13 09:35:45

2

csv模塊不會同時處理「and」引號的情況,如果沒有提供這種方言的模塊,就必須進入解析業務。爲了避免依賴第三方模塊中,我們可以使用re模塊進行詞法分析,使用re.MatchObject.lastindex噱頭將令牌類型與匹配的模式相關聯

以下代碼作爲腳本運行時會通過所有顯示的測試,與Python 2.7和2.2

import re 

# lexical token symbols 
DQUOTED, SQUOTED, UNQUOTED, COMMA, NEWLINE = xrange(5) 

_pattern_tuples = (
    (r'"[^"]*"', DQUOTED), 
    (r"'[^']*'", SQUOTED), 
    (r",", COMMA), 
    (r"$", NEWLINE), # matches end of string OR \n just before end of string 
    (r"[^,\n]+", UNQUOTED), # order in the above list is important 
    ) 
_matcher = re.compile(
    '(' + ')|('.join([i[0] for i in _pattern_tuples]) + ')', 
    ).match 
_toktype = [None] + [i[1] for i in _pattern_tuples] 
# need dummy at start because re.MatchObject.lastindex counts from 1 

def csv_split(text): 
    """Split a csv string into a list of fields. 
    Fields may be quoted with " or ' or be unquoted. 
    An unquoted string can contain both a " and a ', provided neither is at 
    the start of the string. 
    A trailing \n will be ignored if present. 
    """ 
    fields = [] 
    pos = 0 
    want_field = True 
    while 1: 
     m = _matcher(text, pos) 
     if not m: 
      raise ValueError("Problem at offset %d in %r" % (pos, text)) 
     ttype = _toktype[m.lastindex] 
     if want_field: 
      if ttype in (DQUOTED, SQUOTED): 
       fields.append(m.group(0)[1:-1]) 
       want_field = False 
      elif ttype == UNQUOTED: 
       fields.append(m.group(0)) 
       want_field = False 
      elif ttype == COMMA: 
       fields.append("") 
      else: 
       assert ttype == NEWLINE 
       fields.append("") 
       break 
     else: 
      if ttype == COMMA: 
       want_field = True 
      elif ttype == NEWLINE: 
       break 
      else: 
       print "*** Error dump ***", ttype, repr(m.group(0)), fields 
       raise ValueError("Missing comma at offset %d in %r" % (pos, text)) 
     pos = m.end(0) 
    return fields 

if __name__ == "__main__": 
    tests = (
     ("""hey,hello,,"hello,world",'hey,world'\n""", ['hey', 'hello', '', 'hello,world', 'hey,world']), 
     ("""\n""", ['']), 
     ("""""", ['']), 
     ("""a,b\n""", ['a', 'b']), 
     ("""a,b""", ['a', 'b']), 
     (""",,,\n""", ['', '', '', '']), 
     ("""a,contains both " and ',c""", ['a', 'contains both " and \'', 'c']), 
     ("""a,'"starts with "...',c""", ['a', '"starts with "...', 'c']), 
     ) 
    for text, expected in tests: 
     result = csv_split(text) 
     print 
     print repr(text) 
     print repr(result) 
     print repr(expected) 
     print result == expected 
2

我捏造了這樣的東西。我猜想,這非常多餘,但它爲我做了這份工作。你必須有點適應它規範:

def csv_splitter(line): 
    splitthese = [0] 
    splitted = [] 
    splitpos = True 
    for nr, i in enumerate(line): 
     if i == "\"" and splitpos == True: 
      splitpos = False 
     elif i == "\"" and splitpos == False: 
      splitpos = True 
     if i == "," and splitpos == True: 
      splitthese.append(nr) 
    splitthese.append(len(line)+1) 
    for i in range(len(splitthese)-1): 
     splitted.append(re.sub("^,|\"","",line[splitthese[i]:splitthese[i+1]])) 
    return splitted