2015-12-04 17 views

回答

2

它將標點符號字典構建爲空格,翻譯字符串(有效地去除標點符號),然後在空白處分割以生成單詞列表。

分步...首先構建一個字符翻譯字典,其中鍵是標點符號,替換字符是空格。這將使用字典理解構建詞典:

from string import punctuation 
s = 'life is short, stunt it!!?' 
D = {ord(ch):" " for ch in punctuation} 
print(D) 

結果:

{64: ' ', 124: ' ', 125: ' ', 91: ' ', 92: ' ', 93: ' ', 94: ' ', 95: ' ', 96: ' ', 33: ' ', 34: ' ', 35: ' ', 36: ' ', 37: ' ', 38: ' ', 39: ' ', 40: ' ', 41: ' ', 42: ' ', 43: ' ', 44: ' ', 45: ' ', 46: ' ', 47: ' ', 123: ' ', 126: ' ', 58: ' ', 59: ' ', 60: ' ', 61: ' ', 62: ' ', 63: ' '} 

這一步是多餘的。雖然字典看起來不同,但字典是無序的,鍵和值是相同的。 maketrans可以做的是將字符鍵轉換爲序號值,如translate所要求的,但是在創建字典時已經完成了。它還有其他的用例,這裏不使用,所以maketrans可以被刪除。

tbl = str.maketrans(D) 
print(tbl) 
print(D == tbl) 

結果:

{64: ' ', 60: ' ', 61: ' ', 91: ' ', 92: ' ', 93: ' ', 94: ' ', 95: ' ', 96: ' ', 33: ' ', 34: ' ', 35: ' ', 36: ' ', 37: ' ', 38: ' ', 39: ' ', 40: ' ', 41: ' ', 42: ' ', 43: ' ', 44: ' ', 45: ' ', 46: ' ', 47: ' ', 59: ' ', 62: ' ', 58: ' ', 123: ' ', 124: ' ', 125: ' ', 126: ' ', 63: ' '} 
True 

現在做翻譯:

s = s.translate(tbl) 
print(s) 

結果:

life is short stunt it 

拆分成單詞的列表:

print(s.split()) 

結果:

['life', 'is', 'short', 'stunt', 'it'] 
0

{ord(ch):" " for ch in punctuation}dictionary comprehension

這些類似於(並且基於)list comprehensions

Blog post I wrote explaining list comprehensions

您可以從一個Python shell中運行這段代碼,看看每一行的作用:

>>> from string import punctuation 
>>> punctuation 
'!"#$%&\'()*+,-./:;<=>[email protected][\\]^_`{|}~' 
>>> punctuation_to_spaces = {ord(ch): " " for ch in punctuation} 
>>> punctuation_to_spaces 
{64: ' ', 124: ' ', 125: ' ', 91: ' ', 92: ' ', 93: ' ', 94: ' ', 95: ' ', 96: ' ', 33: ' ', 34: ' ', 35: ' ', 36: ' ', 37: ' ', 38: ' ', 39: ' ', 40: ' ', 41: ' ', 42: ' ', 43: ' ', 44: ' ', 45: ' ', 46: ' ', 47: ' ', 123: ' ', 126: ' ', 58: ' ', 59: ' ', 60: ' ', 61: ' ', 62: ' ', 63: ' '} 
>>> punctuation_removal = str.maketrans(punctuation_to_spaces) 
>>> punctuation_removal 
{64: ' ', 60: ' ', 61: ' ', 91: ' ', 92: ' ', 93: ' ', 94: ' ', 95: ' ', 96: ' ', 33: ' ', 34: ' ', 35: ' ', 36: ' ', 37: ' ', 38: ' ', 39: ' ', 40: ' ', 41: ' ', 42: ' ', 43: ' ', 44: ' ', 45: ' ', 46: ' ', 47: ' ', 59: ' ', 62: ' ', 58: ' ', 123: ' ', 124: ' ', 125: ' ', 126: ' ', 63: ' '} 
>>> s = 'life is short, stunt it!!?' 
>>> s.translate(punctuation_removal) 
'life is short stunt it ' 

那本字典理解線基本上是做的標點字符的ASCII值的字典作爲鍵和空間字符作爲值。然後,我們的s字符串上的.translate調用將使用該字典將標點符號轉換爲空格。

ord函數將每個標點符號轉換爲其ASCII數值。

請注意,使用ordmaketrans是多餘的。這兩種解決方案,將工作一樣好,不會是雙重翻譯:

tbl = str.maketrans({ch:" " for ch in punctuation}) 
print(s.translate(tbl).split()) 

tbl = {ord(ch):" " for ch in punctuation} 
print(s.translate(tbl).split()) 
相關問題