2017-03-02 81 views
0

我想從文本文件中獲取標記(單詞)並將其從所有標點符號中除去。我嘗試以下:從Python中的文本中刪除標點符號

import re 

with open('hw.txt') as f: 
    lines_after_254 = f.readlines()[254:] 
    sent = [word for line in lines_after_254 for word in line.lower().split()] 
    words = re.sub('[!#?,.:";]', '', sent) 

我收到以下錯誤:

return _compile(pattern, flags).sub(repl, string, count) 
TypeError: expected string or buffer 

回答

1

在腳本中有幾件事情。你不是在分化,而是把所有東西都分成單個字符!此外,你將所有內容分解爲字符後刪除特殊字符。

更好的方法是讀取輸入字符串,刪除特殊字符,然後標記輸入字符串。

import re 

# open the input text file and read 
string = open('hw.txt').read() 
print string 

# remove the special charaters from the read string 
no_specials_string = re.sub('[!#?,.:";]', '', string) 
print no_specials_string 

# split the text and store words in a list 
words = no_specials_string.split() 
print words 

另外,如果要拆分成標記,然後再刪除特殊字符,你可以這樣做:

import re 

# open the input text file and read 
string = open('hw.txt').read() 
print string 

# split the text and store words in a list 
words = string.split() 
print words 

# remove special characters from each word in words 
new_words = [re.sub('[!#?,.:";]', '', word) for word in words] 
print new_words 
1

沒有被讀入您的列表

In [14]: with open('data', 'r') as f: 
    ...:  l=f.readlines()[254:] 
    ...:  

In [15]: l 
Out[15]: [] 

假設你想單詞的列表,嘗試此

with open('data', 'r') as f: 
    lines = [line.strip() for line in f] 

sent= [w for word in lines[:254] for w in re.split('\s+', word)] 

find = '[!#?,.:";]' 
replace = '' 

words = [re.sub(find, replace, word) for word in sent] 

as @Keerthana Prabhakaran指出re.sub已被更正

+1

這至今仍保留着錯誤代更快! –

+2

錯誤是'return _compile(pattern,flags).sub(repl,string,count)',這裏'sent'是一個列表! –

1

re.sub被應用於字符串而不是列表!

print re.sub(pattern, '', sent) 

應該

print [re.sub(pattern, '', s) for s in sent] 

希望這有助於!

1

使用下面

import string 
translator = str.maketrans('', '', string.punctuation) 
def remove_puncts(input_string): 
    return input_string.translate(translator) 

用法示例

remove_puncts()功能
input_string = """"YH&W^(*D)#IU*DEO)#brhtr<><}{|_}vrthyb,.,''fehsvhrr;[vrht":"]`[email protected]#$%svbrxs""" 
remove_puncts(input_string) 
'YHWDIUDEObrhtrvrthybfehsvhrrvrhtsvbrxs' 

編輯

速度比較

原來使用translator方法比使用正則表達式

import re, string, time 

pattern = '[!#?,.:";]' 
def regex_sub(input_string): 
    return re.sub(pattern, '', input_string) 

translator = str.maketrans('', '', string.punctuation) 
def string_translator(input_string): 
    return input_string.translate(translator) 

input_string = """cwsx#?;.frvcdr""" 
string_translator(input_string) 
regex_sub(input_string) 

passes = 1000000 
t1 = time() 
for i in range(passes): 
    a = string_translator(input_string) 

t2 = time() 
for i in range(passes): 
    a = regex_sub(input_string) 

t3 = time() 

string_translator_time = t2 - t1 
regex_sub_time = t3 - t2 

print(string_translator_time) # 1.341651439666748 
print(regex_sub_time) # 3.44773268699646