從Python中的文本中刪除標點符號

我想從文本文件中獲取標記（單詞）並將其從所有標點符號中除去。我嘗試以下：從Python中的文本中刪除標點符號

import re 

with open('hw.txt') as f: 
    lines_after_254 = f.readlines()[254:] 
    sent = [word for line in lines_after_254 for word in line.lower().split()] 
    words = re.sub('[!#?,.:";]', '', sent)

我收到以下錯誤：

return _compile(pattern, flags).sub(repl, string, count) 
TypeError: expected string or buffer

來源

2017-03-02 J Doe

在腳本中有幾件事情。你不是在分化，而是把所有東西都分成單個字符！此外，你將所有內容分解爲字符後刪除特殊字符。

更好的方法是讀取輸入字符串，刪除特殊字符，然後標記輸入字符串。

import re 

# open the input text file and read 
string = open('hw.txt').read() 
print string 

# remove the special charaters from the read string 
no_specials_string = re.sub('[!#?,.:";]', '', string) 
print no_specials_string 

# split the text and store words in a list 
words = no_specials_string.split() 
print words

另外，如果要拆分成標記，然後再刪除特殊字符，你可以這樣做：

import re 

# open the input text file and read 
string = open('hw.txt').read() 
print string 

# split the text and store words in a list 
words = string.split() 
print words 

# remove special characters from each word in words 
new_words = [re.sub('[!#?,.:";]', '', word) for word in words] 
print new_words

來源

2017-03-02 04:50:05

沒有被讀入您的列表

In [14]: with open('data', 'r') as f: 
    ...:  l=f.readlines()[254:] 
    ...:  

In [15]: l 
Out[15]: []

假設你想單詞的列表，嘗試此

with open('data', 'r') as f: 
    lines = [line.strip() for line in f] 

sent= [w for word in lines[:254] for w in re.split('\s+', word)] 

find = '[!#?,.:";]' 
replace = '' 

words = [re.sub(find, replace, word) for word in sent]

as @Keerthana Prabhakaran指出re.sub已被更正

來源

2017-03-02 04:14:25 aydow

這至今仍保留着錯誤代更快！ –

錯誤是'return _compile（pattern，flags）.sub（repl，string，count）'，這裏'sent'是一個列表！ –

re.sub被應用於字符串而不是列表！

print re.sub(pattern, '', sent)

應該

print [re.sub(pattern, '', s) for s in sent]

希望這有助於！

來源

2017-03-02 04:26:53

使用下面

import string 
translator = str.maketrans('', '', string.punctuation) 
def remove_puncts(input_string): 
    return input_string.translate(translator)

用法示例

的 remove_puncts()功能

input_string = """"YH&W^(*D)#IU*DEO)#brhtr<><}{|_}vrthyb,.,''fehsvhrr;[vrht":"]`[email protected]#$%svbrxs""" 
remove_puncts(input_string) 
'YHWDIUDEObrhtrvrthybfehsvhrrvrhtsvbrxs'

個

編輯

速度比較

原來使用translator方法比使用正則表達式

import re, string, time 

pattern = '[!#?,.:";]' 
def regex_sub(input_string): 
    return re.sub(pattern, '', input_string) 

translator = str.maketrans('', '', string.punctuation) 
def string_translator(input_string): 
    return input_string.translate(translator) 

input_string = """cwsx#?;.frvcdr""" 
string_translator(input_string) 
regex_sub(input_string) 

passes = 1000000 
t1 = time() 
for i in range(passes): 
    a = string_translator(input_string) 

t2 = time() 
for i in range(passes): 
    a = regex_sub(input_string) 

t3 = time() 

string_translator_time = t2 - t1 
regex_sub_time = t3 - t2 

print(string_translator_time) # 1.341651439666748 
print(regex_sub_time) # 3.44773268699646

來源

2017-03-02 08:06:43

從Python中的文本中刪除標點符號

回答

相關問題