使用多個分隔符高效地分割一個字符串並保留每個分隔符？

我需要使用string.punctuation和string.whitespace中的每個字符作爲分隔符來拆分數據串。使用多個分隔符高效地分割一個字符串並保留每個分隔符？

此外，我需要分隔符保留在輸出列表中，在它們在字符串中分隔的項目之間。

例如，

"Now is the winter of our discontent"

應該輸出：

['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']

我不知道如何做到這一點，而不訴諸嵌套循環的狂歡，這是不可接受的慢。我該怎麼做？

來源

2012-11-01 blz

我猜，因爲你接受你打算連續標點符號DSM的回答保持組合在一起？ – John

@johnthexiii，我接受它，因爲它沒有使用're'。將連續分隔符分組的選項是一個額外的好處，但我相信它也可以使用正則表達式輕鬆完成。 – blz

不同的非正則表達式的方式從別人：

>>> import string 
>>> from itertools import groupby 
>>> 
>>> special = set(string.punctuation + string.whitespace) 
>>> s = "One two three tab\ttabandspace\t end" 
>>> 
>>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)] 
>>> split_combined 
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end'] 
>>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)] 
>>> split_separated 
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end']

能使用的lambda代替dict.fromkeys和.get，我猜。

[編輯]

一些說明：

groupby接受兩個參數，可迭代和（可選的）keyfunction。它通過循環可迭代和組將它們與keyfunction的值：

>>> groupby("sentence", lambda c: c in 'nt') 
<itertools.groupby object at 0x9805af4> 
>>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')] 
[(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])]

其中具有keyfunction的連續值方面組合在一起。（這實際上是一個常見的錯誤來源 - 人們忘記了如果他們想要將可能不連續的術語分組，那麼他們必須首先按keyfunc進行排序。）

正如@JonClements猜想的那樣，我想到的是

>>> special = dict.fromkeys(string.punctuation + string.whitespace, True) 
>>> s = "One two three tab\ttabandspace\t end" 
>>> [''.join(g) for k,g in groupby(s, special.get)] 
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end']

對於我們合併分隔符的情況。如果該值不在字典中，則.get返回None。

來源

2012-11-01 22:08:24 DSM

或另一個選項，而不是lambda（儘管它很醜）'groupby（s，special .__ contains __）'...... –

@JonClements：是的，我想我會在使用特殊方法之前使用字典。：^） – DSM

'partial（contains，special）'then？ ;） –

import re 
import string 

p = re.compile("[^{0}]+|[{0}]+".format(re.escape(
    string.punctuation + string.whitespace))) 

print p.findall("Now is the winter of our discontent")

我使用正則表達式對所有的問題沒有大風扇，但我不認爲你有太多的選擇在這一點，如果你想讓它快而短。

，因爲你不熟悉它，我會解釋的正則表達式：

[...]指任何在方括號裏面的人物的
[^...]意味着任何字符不廣場內括號
+背後意味着一個或多個以前的事情
x|y意味着要匹配x或y

所以正則表達式1個或多個字符，其中無論是所有必須是標點符號和空格，或沒有必須相匹配。方法findall查找模式的所有非重疊匹配。

來源

2012-11-01 21:56:33

你可能想使用're.escape（string.punctuation + string.whitespace）'，否則我認爲你的字符類會在''''早期結束。 –

我不認爲它適用於「..現在是我們不滿的冬天」 – John

@ F.J固定。而'「現在是我們不滿的冬天」對我有用。 –

from string import punctuation, whitespace 

s = "..test. and stuff" 

f = lambda s, c: s + ' ' + c + ' ' if c in punctuation else s + c 
l = sum([reduce(f, word).split() for word in s.split()], []) 

print l

來源

2012-11-01 21:57:07 John

試試這個：

import re 
re.split('(['+re.escape(string.punctuation + string.whitespace)+']+)',"Now is the winter of our discontent")

說明從the Python documentation：

如果捕獲括號在圖案中使用，然後在圖案中的所有組的文本也被返回的一部分結果列表。

來源

2012-11-01 21:58:23 Bula

帶有連續空格的醜陋行爲：'re.split（r'（）'，''* 2）'產生'[''，''，''，''，'']''。 –

@ F.J連續的空格/分隔符應該現在處理得更好。 – Bula

根據您所處理的文本，您可能能夠將分隔符的概念簡化爲「除字母和數字以外的任何內容」。如果這將工作，你可以使用下面的正則表達式的解決方案：

re.findall(r'[a-zA-Z\d]+|[^a-zA-Z\d]', text)

這是假設你要分割每個單獨的分隔符，即使他們會連續發生，所以'foo..bar'將成爲['foo', '.', '.', 'bar']。如果您預期的是['foo', '..', 'bar']，請使用[a-zA-Z\d]+|[^a-zA-Z\d]+（唯一不同的是在最後加上+）。

來源

2012-11-01 22:02:09

這對於ASCII範圍以外的字符不起作用。 – DzinX

解線性（O(n)）時間：

比方說，你有一個字符串：

original = "a, b...c d"

先轉換所有分隔空間：

splitters = string.punctuation + string.whitespace 
trans = string.maketrans(splitters, ' ' * len(splitters)) 
s = original.translate(trans)

現在s == 'a b c d'。現在你可以使用itertools.groupby空間與非空間之間交替：

result = [] 
position = 0 
for _, letters in itertools.groupby(s, lambda c: c == ' '): 
    letter_count = len(list(letters)) 
    result.append(original[position:position + letter_count]) 
    position += letter_count

現在result == ['a', ', ', 'b', '...', 'c', ' ', 'd']，這是你所需要的。

來源

2012-11-01 22:04:30 DzinX

我的看法：

from string import whitespace, punctuation 
import re 

pattern = re.escape(whitespace + punctuation) 
print re.split('([' + pattern + '])', 'now is the winter of')

來源

2012-11-01 22:07:01

+1分鐘後寫完全一樣的東西;） – DzinX

帶連續分隔符的醜陋行爲：'re.split（'（['+ pattern +']）'，'..'）'result in'[''， '。'，''，'。'，'']'。 –

-1

from itertools import chain, cycle, izip 

s = "Now is the winter of our discontent" 
words = s.split() 

wordsWithWhitespace = list(chain.from_iterable(izip(words, cycle([" "])))) 
# result : ['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent', ' ']

來源

2012-11-01 22:07:39 lucasg

-1：僅適用於空格作爲分隔符。 – DzinX

用於分隔符的任意集合：

def separate(myStr, seps): 
    answer = [] 
    temp = [] 
    for char in myStr: 
     if char in seps: 
      answer.append(''.join(temp)) 
      answer.append(char) 
      temp = [] 
     else: 
      temp.append(char) 
    answer.append(''.join(temp)) 
    return answer 

In [4]: print separate("Now is the winter of our discontent", set(' ')) 
['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent'] 

In [5]: print separate("Now, really - it is the winter of our discontent", set(' ,-')) 
['Now', ',', '', ' ', 'really', ' ', '', '-', '', ' ', 'it', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']

希望這有助於

來源

2012-11-01 22:18:04 inspectorG4dget

這可能會開始變慢你使用'string.punctuation + string.whitespace'作爲'seps'參數 - 對於每個字符，你都在線性時間內搜索分隔符列表。 – DzinX

如果您將它們作爲「集合」傳遞，則不適用 – inspectorG4dget

使用多個分隔符高效地分割一個字符串並保留每個分隔符？

回答

相關問題