2015-09-14 54 views
2

我的數據是在列表中。我將數據標記爲。數據包含非字母(例如,?,。,!)。移除標記非字母Python列表

我想刪除從下表列出的非字母(例如,?,。,!)。

[['comfortable', 
    'questions?', 
    'menu', 
    'items!', 
    'time', 
    'lived', 
    'there,', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies.'], 
['.', 
    'sure', 
    'giving', 
    'wheat', 
    'fiction', 
    'free', 
    'foodthis', 
    'place', 
    'clean.']] 

輸出應該是這樣的:

[['comfortable', 
    'questions', 
    'menu', 
    'items', 
    'time', 
    'lived', 
    'there,', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies'], 
['sure', 
    'giving', 
    'wheat', 
    'fiction', 
    'free', 
    'foodthis', 
    'place', 
    'clean']] 

我試過以下(不工作)代碼:

import re 
tokens = [re.sub(r'[^A-Za-z0-9]+', '', x) for x in texts] 

什麼建議嗎?

+0

進口重新 令牌= [應用re.sub(R '[^ A-ZA-Z0-9] +', '',X)爲在文本X] – kevin

+0

上述代碼不工作。 – kevin

+0

您應該更新的問題的企圖,什麼導致你在那了,等 –

回答

2

你的正則表達式的方法是行不通的,因爲你所擁有的是列表的列表,因此你試圖通過內部列表來re.sub

你應該在內部列表迭代以及然後用你的re.sub。示例 -

>>> tokens = [[y for y in (re.sub(r'[^A-Za-z0-9]+', '', x) for x in sublst) if y] for sublst in texts] 
>>> pprint.pprint(tokens) 
[['comfortable', 
    'questions', 
    'menu', 
    'items', 
    'time', 
    'lived', 
    'there', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies'], 
['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']] 
0
new_lst = [] 
for inner in lst: 
    new_inner = [] 
    for word in inner: 
     filtered = ''.join([filter(str.isalpha, c) for c in word]) 
     if len(filtered) > 0: 
      new_inner.append(filtered) 
    new_lst.append(new_inner) 
print new_lst 
+1

請一些解釋性添加到您的代碼。 – kenorb

1

快到了,你的令牌列表的列表,但您的列表理解是隻盯着第一個列表的元素。

from pprint import pprint 

import re 

tokens = [['comfortable', 
      'questions?', 
      'menu', 
      'items!', 
      'time', 
      'lived', 
      'there,', 
      'could', 
      'easily', 
      'direct', 
      'people', 
      'appropriate', 
      'menu', 
      'choices', 
      'given', 
      'allergies.'], 
      ['.', 
      'sure', 
      'giving', 
      'wheat', 
      'fiction', 
      'free', 
      'foodthis', 
      'place', 
      'clean.']] 

out = [list(filter(None, [re.sub(r'[^A-Za-z0-9]+', '', x) for x in y])) for y in 
     tokens] 

pprint(out) 

生產:

[['comfortable', 
    'questions', 
    'menu', 
    'items', 
    'time', 
    'lived', 
    'there', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies'], 
['sure', 
    'giving', 
    'wheat', 
    'fiction', 
    'free', 
    'foodthis', 
    'place', 
    'clean']] 
-1
import string 

data = [['comfortable', 
    'questions?', 
    'menu', 
    'items!', 
    'time', 
    'lived', 
    'there,', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies.'], 
['.', 
    'sure', 
    'giving', 
    'wheat', 
    'fiction', 
    'free', 
    'foodthis', 
    'place', 
    'clean.']] 

result = [] 
for d in data: 
    for r in string.punctuation: 
     d = [x.replace(r, '') for x in d] 
    result.append([x for x in d if d]) 
print result 
1

如果它總是在最後,你可以str.rstrip標點符號:

from string import punctuation 

for sub in l: 
    sub[:] = (word for word in (w.rstrip(punctuation) for w in sub) 
      if word) 

輸出:

from pprint import pprint as pp 
pp(l) 


[['comfortable', 
    'questions', 
    'menu', 
    'items', 
    'time', 
    'lived', 
    'there', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies'], 
['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']] 

或者使用str.translate可以從任何位置刪除:

from string import punctuation 

for sub in l: 
    sub[:] = (word for word in (w.translate(None, punctuation) for w in sub) 
      if word) 

輸出:

[['comfortable', 
    'questions', 
    'menu', 
    'items', 
    'time', 
    'lived', 
    'there', 
    'could', 
    'easily', 
    'direct', 
    'people', 
    'appropriate', 
    'menu', 
    'choices', 
    'given', 
    'allergies'], 
['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']] 

如果你想有一個新的列表:

cleaned = [word for sub in l 
      for word in (w.translate(None, punctuation) 
         for w in sub) if word] 

轉換效率要高得多比一個正則表達式,如果標點符號是在結束rstrip是更有效的再次:

In [2]: %%timeit 
    ....: r = re.compile(r'[^A-Za-z0-9]+') 
    ....: [[y for y in (r.sub('', x) for x in sublst) if y] for sublst in l] 
    ....: 
10000 loops, best of 3: 37.3 µs per loop 

In [3]: %%timeit 
    ....: out = [list(filter(None, [re.sub(r'[^A-Za-z0-9]+', '', x) for x in y])) for y in 
    ....:  l] 
    ....: 
10000 loops, best of 3: 58.3 µs per loop 

In [4]: from string import punctuation 

In [5]: %%timeit 
    ...: cleaned = [word for sub in l 
    ...:   for word in (w.translate(None, punctuation) 
    ...:       for w in sub) if word] 
    ...: 

100000 loops, best of 3: 11.6 µs per loop 

In [6]: %%timeit 
    ...: cleaned = [word for sub in l 
    ...:   for word in (w.rstrip(punctuation) 
    ...:       for w in sub) if word] 
    ...: 

100000 loops, best of 3: 6.81 µs per loop 
In [7]: %%timeit 
result = []      
for d in l:              
    for r in string.punctuation: 
     d = [x.replace(r, '') for x in d] 
    result.append([x for x in d if d]) 
    ....: 
10000 loops, best of 3: 160 µs per loop