Python的分裂標點符號，但仍然包括它

這是我有一個字符串列表：Python的分裂標點符號，但仍然包括它

[ 
    ['It', 'was', 'the', 'besst', 'of', 'times,'], 
    ['it', 'was', 'teh', 'worst', 'of', 'times'] 
]

我需要times,分裂標點符號，是'times',','
或另一個例子，如果我有Why?!?我會需要它是'Why','?!?'

import string 

def punctuation(string): 

for word in string: 
    if word contains (string.punctuation): 
     word.split()

我知道它根本不是python語言！但這就是我想要的。

來源

2013-07-16 user2553807

你的意思是你要來標記？因此，如果您還有「$ 3.88」或「：」字符串中的尾隨單詞，您是否也想將它們分開，並保留分隔符？ – Tom

我之前沒有使用過tokenize函數。那會做什麼？ – user2553807

沒有一個。但是有一個包http://nltk.org/api/nltk.tokenize.html。 – Tom

你可以使用正則表達式，例如：

In [1]: import re 

In [2]: re.findall(r'(\w+)(\W+)', 'times,') 
Out[2]: [('times', ',')] 

In [3]: re.findall(r'(\w+)(\W+)', 'why?!?') 
Out[3]: [('why', '?!?')] 

In [4]:

來源

2013-07-16 16:39:06

您可以使用finditer即使字符串比較複雜。

>>> r = re.compile(r"(\w+)(["+string.punctuation+"]*)") 
    >>> s = 'Why?!?Why?*Why' 
    >>> [x.groups() for x in r.finditer(s)] 
    [('Why', '?!?'), ('Why', '?*'), ('Why', '')] 
    >>>

來源

2013-07-16 16:45:06 zhangyangyu

是這樣的嗎？（假設PUNCT總是在結尾）

def lcheck(word): 
    for i, letter in enumerate(word): 
     if not word[i].isalpha(): 
      return [word[0:(i-1)],word[i:]] 
    return [word] 

value = 'times,' 
print lcheck(value)

來源

2013-07-16 16:48:13 Jiminion

謝謝格式化，Jon。 – Jiminion

lcheck（「不」）可能無法按預期工作。 – dansalmo

沒有正則表達式的發電機解決方案：

import string 
from itertools import takewhile, dropwhile 

def splitp(s): 
    not_punc = lambda c: c in string.ascii_letters+"'" # won't split "don't" 
    for w in s: 
     punc = ''.join(dropwhile(not_punc, w)) 
     if punc: 
      yield ''.join(takewhile(not_punc, w)) 
      yield punc 
     else: 
      yield w 

list(splitp(s))

來源

2013-07-16 17:48:36 dansalmo

Python的分裂標點符號，但仍然包括它

回答

相關問題