2015-04-02 122 views
3

我使用下面的代碼從文件中刪除所有HTML標記並將其轉換爲純文本。此外,我必須將XML/HTML字符轉換爲ASCII字符。在這裏,我有21行全文閱讀。這意味着如果我想轉換一個巨大的文件,我不得不花費大量的資源來做到這一點。用Python替換文本中的幾個單詞

您是否有任何想法提高代碼的效率並提高速度,同時減少資源的使用?

# -*- coding: utf-8 -*- 
import re 

# This file contains HTML. 
file = open('input-file.html', 'r') 
temp = file.read() 

# Replace Some XML/HTML characters to ASCII ones. 
temp = temp.replace ('‘',"""'""") 
temp = temp.replace ('’',"""'""") 
temp = temp.replace ('“',"""\"""") 
temp = temp.replace ('”',"""\"""") 
temp = temp.replace ('‚',""",""") 
temp = temp.replace ('′',"""'""") 
temp = temp.replace ('″',"""\"""") 
temp = temp.replace ('«',"""«""") 
temp = temp.replace ('»',"""»""") 
temp = temp.replace ('‹',"""‹""") 
temp = temp.replace ('›',"""›""") 
temp = temp.replace ('&',"""&""") 
temp = temp.replace ('–',""" – """) 
temp = temp.replace ('—',""" — """) 
temp = temp.replace ('®',"""®""") 
temp = temp.replace ('©',"""©""") 
temp = temp.replace ('™',"""™""") 
temp = temp.replace ('¶',"""¶""") 
temp = temp.replace ('•',"""•""") 
temp = temp.replace ('·',"""·""") 

# Replace HTML tags with an empty string. 
result = re.sub("<.*?>", "", temp) 
print(result) 

# Write the result to a new file. 
file = open("output-file.txt", "w") 
file.write(result) 
file.close() 

回答

0

使用sting.tranlate()string.maketran()的問題是,當我使用它們我不得不甲炭分配給另一個。例如

print string.maketran("abc","123") 

但是,我需要分配一個HTML/XML字符像&lsquo;在ASCII單引號(')。這意味着我不得不使用下面的代碼:

print string.maketran("&lsquo;","'") 

它面臨着以下錯誤:

ValueError: maketrans arguments must have same length

然而,如果我使用的HTMLParser,它將所有的HTML/XML轉換爲ASCII沒有上述問題。我還添加了一個encode('utf-8')解決以下錯誤:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 246: ordinal not in range(128)

# -*- coding: utf-8 -*- 
import re 
from HTMLParser import HTMLParser 

# This file contains HTML. 
file = open('input-file.txt', 'r') 
temp = file.read() 

# Replace all XML/HTML characters to ASCII ones. 
temp = HTMLParser.unescape.__func__(HTMLParser, temp) 

# Replace HTML tags with an empty string. 
result = re.sub("<.*?>", "", temp) 

# Encode the text to UTF-8 for preventing some errors. 
result = result.encode('utf-8') 
print(result) 

# Write the result to a new file. 
file = open("output-file.txt", "w") 
file.write(result) 
file.close() 
1

你可以使用string.translate()

from string import maketrans # Required to call maketrans function. 

intab = "string of original characters that need to be replaced" 
outtab = "string of new characters" 
trantab = maketrans(intab, outtab)# maketrans() is helper function in the string module to create a translation table 

str = "this is string example....wow!!!";#you string 
print str.translate(trantab); 

注意,在python3 str.translate會比python2顯著慢,特別是如果你只翻譯幾個字符。這是因爲它必須處理unicode字符,因此使用字典來執行翻譯而不是索引字符串。

1

我的第一本能是string.translate()string.maketrans()的組合這隻會讓一次而不是幾次。每次撥打str.replace()都會自行傳遞整個字符串,並且您希望避免這種情況。

一個例子:

from string import ascii_lowercase, maketrans, translate 

from_str = ascii_lowercase 
to_str = from_str[-1]+from_str[0:-1] 
foo = 'the quick brown fox jumps over the lazy dog.' 
bar = translate(foo, maketrans(from_str, to_str)) 
print bar # sgd pthbj aqnvm enw itlor nudq sgd kzyx cnf.