移除標記非字母Python列表

我的數據是在列表中。我將數據標記爲。數據包含非字母（例如，？，。，！）。移除標記非字母Python列表

我想刪除從下表列出的非字母（例如，？，。，！）。

[['comfortable', 
    'questions?', 
    'menu', 
    'items!', 
    'time', 
    'lived', 
    'there,', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies.'], 
['.', 
    'sure', 
    'giving', 
    'wheat', 
    'fiction', 
    'free', 
    'foodthis', 
    'place', 
    'clean.']]

輸出應該是這樣的：

[['comfortable', 
    'questions', 
    'menu', 
    'items', 
    'time', 
    'lived', 
    'there,', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies'], 
['sure', 
    'giving', 
    'wheat', 
    'fiction', 
    'free', 
    'foodthis', 
    'place', 
    'clean']]

我試過以下（不工作）代碼：

import re 
tokens = [re.sub(r'[^A-Za-z0-9]+', '', x) for x in texts]

什麼建議嗎？

來源

2015-09-14 kevin

進口重新令牌= [應用re.sub（R '[^ A-ZA-Z0-9] +'， ''，X）爲在文本X] – kevin

上述代碼不工作。 – kevin

您應該更新的問題的企圖，什麼導致你在那了，等 –

你的正則表達式的方法是行不通的，因爲你所擁有的是列表的列表，因此你試圖通過內部列表來re.sub。

你應該在內部列表迭代以及然後用你的re.sub。示例 -

>>> tokens = [[y for y in (re.sub(r'[^A-Za-z0-9]+', '', x) for x in sublst) if y] for sublst in texts] 
>>> pprint.pprint(tokens) 
[['comfortable', 
    'questions', 
    'menu', 
    'items', 
    'time', 
    'lived', 
    'there', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies'], 
['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']]

來源

2015-09-14 18:46:00

new_lst = [] 
for inner in lst: 
    new_inner = [] 
    for word in inner: 
     filtered = ''.join([filter(str.isalpha, c) for c in word]) 
     if len(filtered) > 0: 
      new_inner.append(filtered) 
    new_lst.append(new_inner) 
print new_lst

來源

2015-09-14 18:48:57 taesu

請一些解釋性添加到您的代碼。 – kenorb

快到了，你的令牌列表的列表，但您的列表理解是隻盯着第一個列表的元素。

from pprint import pprint 

import re 

tokens = [['comfortable', 
      'questions?', 
      'menu', 
      'items!', 
      'time', 
      'lived', 
      'there,', 
      'could', 
      'easily', 
      'direct', 
      'people', 
      'appropriate', 
      'menu', 
      'choices', 
      'given', 
      'allergies.'], 
      ['.', 
      'sure', 
      'giving', 
      'wheat', 
      'fiction', 
      'free', 
      'foodthis', 
      'place', 
      'clean.']] 

out = [list(filter(None, [re.sub(r'[^A-Za-z0-9]+', '', x) for x in y])) for y in 
     tokens] 

pprint(out)

生產：

[['comfortable', 
    'questions', 
    'menu', 
    'items', 
    'time', 
    'lived', 
    'there', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies'], 
['sure', 
    'giving', 
    'wheat', 
    'fiction', 
    'free', 
    'foodthis', 
    'place', 
    'clean']]

來源

2015-09-14 18:49:33 reupen

-1

import string 

data = [['comfortable', 
    'questions?', 
    'menu', 
    'items!', 
    'time', 
    'lived', 
    'there,', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies.'], 
['.', 
    'sure', 
    'giving', 
    'wheat', 
    'fiction', 
    'free', 
    'foodthis', 
    'place', 
    'clean.']] 

result = [] 
for d in data: 
    for r in string.punctuation: 
     d = [x.replace(r, '') for x in d] 
    result.append([x for x in d if d]) 
print result

來源

2015-09-14 18:52:30

如果它總是在最後，你可以str.rstrip標點符號：

from string import punctuation 

for sub in l: 
    sub[:] = (word for word in (w.rstrip(punctuation) for w in sub) 
      if word)

輸出：

from pprint import pprint as pp 
pp(l) 


[['comfortable', 
    'questions', 
    'menu', 
    'items', 
    'time', 
    'lived', 
    'there', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies'], 
['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']]

或者使用str.translate可以從任何位置刪除：

from string import punctuation 

for sub in l: 
    sub[:] = (word for word in (w.translate(None, punctuation) for w in sub) 
      if word)

輸出：

[['comfortable', 
    'questions', 
    'menu', 
    'items', 
    'time', 
    'lived', 
    'there', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies'], 
['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']]

如果你想有一個新的列表：

cleaned = [word for sub in l 
      for word in (w.translate(None, punctuation) 
         for w in sub) if word]

轉換效率要高得多比一個正則表達式，如果標點符號是在結束rstrip是更有效的再次：

In [2]: %%timeit 
    ....: r = re.compile(r'[^A-Za-z0-9]+') 
    ....: [[y for y in (r.sub('', x) for x in sublst) if y] for sublst in l] 
    ....: 
10000 loops, best of 3: 37.3 µs per loop 

In [3]: %%timeit 
    ....: out = [list(filter(None, [re.sub(r'[^A-Za-z0-9]+', '', x) for x in y])) for y in 
    ....:  l] 
    ....: 
10000 loops, best of 3: 58.3 µs per loop 

In [4]: from string import punctuation 

In [5]: %%timeit 
    ...: cleaned = [word for sub in l 
    ...:   for word in (w.translate(None, punctuation) 
    ...:       for w in sub) if word] 
    ...: 

100000 loops, best of 3: 11.6 µs per loop 

In [6]: %%timeit 
    ...: cleaned = [word for sub in l 
    ...:   for word in (w.rstrip(punctuation) 
    ...:       for w in sub) if word] 
    ...: 

100000 loops, best of 3: 6.81 µs per loop 
In [7]: %%timeit 
result = []      
for d in l:              
    for r in string.punctuation: 
     d = [x.replace(r, '') for x in d] 
    result.append([x for x in d if d]) 
    ....: 
10000 loops, best of 3: 160 µs per loop

來源

2015-09-14 18:54:26

移除標記非字母Python列表

回答

相關問題