2017-05-30 188 views
0

我想使用python re.split()以逗號將一個句子分割成多個字符串,但我不想將其應用於用逗號分隔的單個單詞,例如:重新分割特殊情況以分割逗號分隔的字符串

s = "Yes, alcohol can have a place in a healthy diet." 
desired result = ["Yes, alcohol can have a place in a healthy diet."] 

另一個例子:

s = "But, of course, excess alcohol is terribly harmful to health in a variety of ways, and even moderatealcohol intake is associated with an increase in the number two cause of premature death: cancer." 
desired output = ["But, of course" , "excess alcohol is terribly harmful to health in a variety of ways" , "and even moderatealcohol intake is associated with an increase in the number two cause of premature death: cancer."] 

任何指針?請。

+0

你嘗試過這麼遠嗎? – depperm

+1

也許你應該在逗號分割,然後重新組合單個單詞與下一個短語。另外,如果有多個這樣的詞「嘿,嘿,嘿,當然,是......」? –

+0

@depperm,我試過像sep = re.split('(?<!\ d)[,](?!\ d)',string)和其他沒有人似乎是防彈的 –

回答

1

因爲Python不支持可變長度lookbehind assertions在正則表達式,我會使用re.findall()代替:

In [3]: re.findall(r"\s*((?:\w+,)?[^,]+)",s) 
Out[3]: 
['But, of course', 
'excess alcohol is terribly harmful to health in a variety of ways', 
'and even moderatealcohol intake is associated with an increase in the number two cause of premature death: cancer.'] 

說明:

\s*  # Match optional leading whitespace, don't capture that 
(   # Capture in group 1: 
(?:\w+,)? # optionally: A single "word", followed by a comma 
[^,]+  # and/or one or more characters except commas 
)   # End of group 1 
+0

一個額外的請求,我們可以修改正則表達式以滿足以下要求。 [「頭頸部癌,食道癌,肝癌,結腸癌,直腸癌和乳腺癌都與飲酒有關。」] –

+0

困難。你的輸入是否只包含一個句子,或者可能有多個?如果是後者,你應該首先使用NLP工具將輸入分成單獨的句子。然後我認爲這可以做到。 –

+0

是的,我的輸入包含單個句子,因爲我已經在使用NLP將大字符串拆分爲單個句子。 :) –