2016-08-27 122 views
1

我想在逗號和/或句點中分隔nltk中的字符串。我已經嘗試過sent_tokenize(),但它僅在時段分開。如何在逗號或句點中分割字符串nltk

我也試過這個代碼

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars 
ex_sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python." 
class CommaPoint(PunktLanguageVars): 
    sent_end_chars = ('.','?','!',',') 
tokenizer = PunktSentenceTokenizer(lang_vars = CommaPoint()) 
n_w=tokenizer.tokenize(ex_sent) 
print n_w 

上面的代碼的輸出是

['This is an example showing sentence filtration.This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.\n'] 

當我試圖給 ''沒有任何空間它是把它作爲一個詞

我想要的輸出

['This is an example showing sentence filtration.' 'This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.'] 

Image of the code

+0

你能更具體地瞭解你的問題嗎?給出一些輸入和期望輸出的例子,並試圖說出你所嘗試過的。看看http://stackoverflow.com/help/how-to-ask – alvas

+0

嗨,這是我第一次在stackoverflow。我試圖解釋我的問題,希望你能回答我。謝謝 –

+0

嗨阿爾瓦斯,我希望你能幫助我這次.. –

回答

3

怎麼樣的東西與re簡單:

>>> import re 
>>> sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python." 
>>> re.split(r'[.,]', sent) 
['This is an example showing sentence filtration', 'This is how it is done', ' in case of Python I want to learn more', ' So', ' that i can have some experience over it', ' by it I mean python', ''] 

爲了保持分隔符,你可以使用羣組:

>>> re.split(r'([.,])', sent) 
['This is an example showing sentence filtration', '.', 'This is how it is done', ',', ' in case of Python I want to learn more', '.', ' So', ',', ' that i can have some experience over it', ',', ' by it I mean python', '.', ''] 
+0

非常感謝你alvas。但在這種情況下,你會發現它確定它正在拆分句子,但也提取了我想保留的逗號和句號。還有什麼辦法可以在nltk中做到這一點 –

+0

不要把它寫回逗號和fullstops; P。我建議你儘可能使用本地python庫,如果它給你你想要的輸出。如果你真的必須使用nltk,那麼通用的nlp模型通常可以處理通常格式化的新聞文本。 – alvas

相關問題