如果它總是在最後,你可以str.rstrip
標點符號:
from string import punctuation
for sub in l:
sub[:] = (word for word in (w.rstrip(punctuation) for w in sub)
if word)
輸出:
from pprint import pprint as pp
pp(l)
[['comfortable',
'questions',
'menu',
'items',
'time',
'lived',
'there',
'could',
'easily',
'direct',
'people',
'appropriate',
'menu',
'choices',
'given',
'allergies'],
['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']]
或者使用str.translate
可以從任何位置刪除:
from string import punctuation
for sub in l:
sub[:] = (word for word in (w.translate(None, punctuation) for w in sub)
if word)
輸出:
[['comfortable',
'questions',
'menu',
'items',
'time',
'lived',
'there',
'could',
'easily',
'direct',
'people',
'appropriate',
'menu',
'choices',
'given',
'allergies'],
['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']]
如果你想有一個新的列表:
cleaned = [word for sub in l
for word in (w.translate(None, punctuation)
for w in sub) if word]
轉換效率要高得多比一個正則表達式,如果標點符號是在結束rstrip
是更有效的再次:
In [2]: %%timeit
....: r = re.compile(r'[^A-Za-z0-9]+')
....: [[y for y in (r.sub('', x) for x in sublst) if y] for sublst in l]
....:
10000 loops, best of 3: 37.3 µs per loop
In [3]: %%timeit
....: out = [list(filter(None, [re.sub(r'[^A-Za-z0-9]+', '', x) for x in y])) for y in
....: l]
....:
10000 loops, best of 3: 58.3 µs per loop
In [4]: from string import punctuation
In [5]: %%timeit
...: cleaned = [word for sub in l
...: for word in (w.translate(None, punctuation)
...: for w in sub) if word]
...:
100000 loops, best of 3: 11.6 µs per loop
In [6]: %%timeit
...: cleaned = [word for sub in l
...: for word in (w.rstrip(punctuation)
...: for w in sub) if word]
...:
100000 loops, best of 3: 6.81 µs per loop
In [7]: %%timeit
result = []
for d in l:
for r in string.punctuation:
d = [x.replace(r, '') for x in d]
result.append([x for x in d if d])
....:
10000 loops, best of 3: 160 µs per loop
進口重新 令牌= [應用re.sub(R '[^ A-ZA-Z0-9] +', '',X)爲在文本X] – kevin
上述代碼不工作。 – kevin
您應該更新的問題的企圖,什麼導致你在那了,等 –