Tokenize基於空格和尾部標點符號？

我想要拿出正則表達式來將字符串分成基於空格或尾標點的列表。Tokenize基於空格和尾部標點符號？

例如

s = 'hel-lo this has whi(.)te, space. very \n good'

我要的是

['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

s.split()得到最深的方式存在，只是它不採取尾隨空白的照顧。

來源

2017-04-27 Kewl

你也允許使用其他庫嗎？或者你想只使用正則表達式？ – titipata

是的，任何圖書館的使用都很好 – Kewl

import re 
s = 'hel-lo this has whi(.)te, space. very \n good' 
[x for x in re.split(r"([.,!?]+)?\s+", s) if x] 
# => ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

您可能需要調整什麼是「標點」是。

來源

2017-04-27 02:26:17 Amadan

使用spacy的粗略解決方案。它已經很好地使用詞彙化詞彙。

import spacy 
s = 'hel-lo this has whi(.)te, space. very \n good' 
nlp = spacy.load('en') 
ls = [t.text for t in nlp(s) if t.text.strip()] 

>> ['hel', '-', 'lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

然而，-之間也令牌化的話，所以我借的解決方案，從here合併-之間的話重新走到一起。

merge = [(i-1, i+2) for i, s in enumerate(ls) if i >= 1 and s == '-'] 
for t in merge[::-1]: 
    merged = ''.join(ls[t[0]:t[1]]) 
    ls[t[0]:t[1]] = [merged] 

>> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

來源

2017-04-27 02:12:13 titipata

我正在使用Python 3.6.1。

import re 

s = 'hel-lo this has whi(.)te, space. very \n good' 
a = [] # this list stores the items 
for i in s.split(): # split on whitespaces 
    j = re.split('(\,|\.)$',i) # split on your definition of trailing punctuation marks 
    if len(j) > 1: 
     a.extend(j[:-1]) 
    else: 
     a.append(i) 
# a -> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

來源

2017-04-27 04:03:38

Tokenize基於空格和尾部標點符號？

回答

相關問題