2016-01-05 125 views
1

我有這個字符串,我想拆就時期:如何在分隔符分割字符串,但排除其他字符串

j = 'you can get it cheaper than $20.99. shop at amazon.com. hurry before prices go up.' 

這是結果,我想:

['you can get it cheaper than $20.99. ', 'shop at amazon.com.', ' hurry before prices go up.'] 

我在每個小寫字母前面加上一個句點,後面跟着句號和空格。

x = [] 
sentences = re.split(r'([a-z]\.|\d\.\s)', j) 
sentence_endings = sentences[1::2] 
for position in range(len(sentences)): 
     if sentences[position] in sentence_endings: 
      x.append(sentences[position -1] + sentences[position]) 

打印X給我:

['you can get it cheaper than $20.99. ', 'shop at amazon.', 'com.', ' hurry before prices go up.'] 

我想「amazon.com」是一個字符串,所以我指示正則表達式忽略「.COM」與re.split(r'([a-z]\.|\d\.\s)[^.com]', j) 但不讓我得到我想要的結果。什麼是最好的方法來做到這一點?

回答

1

非正則表達式的選擇可能是使用nltk.sent_tokenize()

>>> import nltk 
>>> j = 'you can get it cheaper than $20.99. shop at amazon.com. hurry before prices go up.' 
>>> nltk.sent_tokenize(j) 
['you can get it cheaper than $20.99.', 'shop at amazon.com.', 'hurry before prices go up.'] 
3

一個簡單的正則表達式上期後面有一個空格可能是\.\s分裂。

您可以使用一個回顧後保存在分裂時期:(?<=\.)\s

如果你想使用一個分裂的方法得到的只是「amazon.com」從你的字符串,你可以嘗試.*(?=amazon.com)|(?<=amazon.com).*

+0

're.split(r'(?<= \。)\ s',s)' –