2010-01-25 276 views
123

下面是解釋這個最簡單的方法。下面是我使用的是什麼:在Python中,如何拆分字符串並保留分隔符?

re.split('\W', 'foo/bar spam\neggs') 
-> ['foo', 'bar', 'spam', 'eggs'] 

這是我想要的東西:

someMethod('\W', 'foo/bar spam\neggs') 
-> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs'] 

的原因是,我想一個字符串分割成令牌,操縱它,然後把它再聚首。

+1

哪些呢'\ W'代表?我在谷歌上失敗了。 – Ooker 2015-08-29 19:26:39

+2

一個_non-word_字符[詳見這裏](https://docs.python.org/2/library/re.html#regular-expression-syntax) – Russell 2015-12-02 21:27:03

回答

168
>>> re.split('(\W)', 'foo/bar spam\neggs') 
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs'] 
+12

這很酷。我不知道re.split是否與捕獲組一樣。 – 2010-01-25 23:48:54

+7

@Laurence:這是記錄:http://docs.python.org/library/re.html#re.split:「由模式發生的分割字符串。如果在模式中使用捕獲括號,則文本該模式中的所有組也返回作爲結果列表的一部分。「 – 2010-01-25 23:54:41

+17

這是嚴重沒有記錄。我一直在使用Python 14年,只是發現了這一點。 – smci 2013-06-19 16:33:17

1

您也可以分割字符串,字符串,而不是正則表達式的數組,像這樣:

def tokenizeString(aString, separators): 
    #separators is an array of strings that are being used to split the the string. 
    #sort separators in order of descending length 
    separators.sort(key=len) 
    listToReturn = [] 
    i = 0 
    while i < len(aString): 
     theSeparator = "" 
     for current in separators: 
      if current == aString[i:i+len(current)]: 
       theSeparator = current 
     if theSeparator != "": 
      listToReturn += [theSeparator] 
      i = i + len(theSeparator) 
     else: 
      if listToReturn == []: 
       listToReturn = [""] 
      if(listToReturn[-1] in separators): 
       listToReturn += [""] 
      listToReturn[-1] += aString[i] 
      i += 1 
    return listToReturn 


print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"])) 
2
# This keeps all separators in result 
########################################################################## 
import re 
st="%%(c+dd+e+f-1523)%%7" 
sh=re.compile('[\+\-//\*\<\>\%\(\)]') 

def splitStringFull(sh, st): 
    ls=sh.split(st) 
    lo=[] 
    start=0 
    for l in ls: 
    if not l : continue 
    k=st.find(l) 
    llen=len(l) 
    if k> start: 
     tmp= st[start:k] 
     lo.append(tmp) 
     lo.append(l) 
     start = k + llen 
    else: 
     lo.append(l) 
     start =llen 
    return lo 
    ############################# 

li= splitStringFull(sh , st) 
['%%(', 'c', '+', 'dd', '+', 'e', '+', 'f', '-', '1523', ')%%', '7'] 
7

另一個是關於Python 3效果很好沒有正則表達式的解決方案

# Split strings and keep separator 
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', ''] 

def split_and_keep(s, sep): 
    if not s: return [''] # consistent with string.split() 

    # Find replacement character that is not used in string 
    # i.e. just use the highest available character plus one 
    # Note: This fails if ord(max(s)) = 0x10FFFF (ValueError) 
    p=chr(ord(max(s))+1) 

    return s.replace(sep, sep+p).split(p) 

for s in test_strings: 
    print(split_and_keep(s, '<')) 


# If the unicode limit is reached it will fail explicitly 
unicode_max_char = chr(1114111) 
ridiculous_string = '<Hello>'+unicode_max_char+'<World>' 
print(split_and_keep(ridiculous_string, '<')) 
9

如果您在換行符上分割,請使用splitlines(True)

>>> 'line 1\nline 2\nline without newline'.splitlines(True) 
['line 1\n', 'line 2\n', 'line without newline'] 

(不是一般的解決方案,但如果有人在這裏添加此來這裏並沒有意識到這種方法存在的。)

1

如果想同時保持分離的正則表達式,而不捕獲組來分割字符串:

def finditer_with_separators(regex, s): 
    matches = [] 
    prev_end = 0 
    for match in regex.finditer(s): 
     match_start = match.start() 
     if (prev_end != 0 or match_start > 0) and match_start != prev_end: 
      matches.append(s[prev_end:match.start()]) 
     matches.append(match.group()) 
     prev_end = match.end() 
    if prev_end < len(s): 
     matches.append(s[prev_end:]) 
    return matches 

regex = re.compile(r"[\(\)]") 
matches = finditer_with_separators(regex, s) 

如果假設正則表達式被包裹成捕獲組:

def split_with_separators(regex, s): 
    matches = list(filter(None, regex.split(s))) 
    return matches 

regex = re.compile(r"([\(\)])") 
matches = split_with_separators(regex, s) 

兩種方式也將刪除在大多數情況下無用和煩人的空組。

1

如果只有1個分離器,你可以使用列表理解:

text = 'foo,bar,baz,qux' 
sep = ',' 

附加/前面加上分隔符:

result = [x+sep for x in text.split(sep)] 
#['foo,', 'bar,', 'baz,', 'qux,'] 
# to get rid of trailing 
result[-1] = result[-1].strip(sep) 
#['foo,', 'bar,', 'baz,', 'qux'] 

result = [sep+x for x in text.split(sep)] 
#[',foo', ',bar', ',baz', ',qux'] 
# to get rid of trailing 
result[0] = result[0].strip(sep) 
#['foo', ',bar', ',baz', ',qux'] 

分離器,因爲它是自己的元素:

result = [u for x in text.split(sep) for u in (x, sep)] 
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ','] 
results = result[:-1] # to get rid of trailing 
相關問題