用Python替換文本中的幾個單詞

我使用下面的代碼從文件中刪除所有HTML標記並將其轉換爲純文本。此外，我必須將XML/HTML字符轉換爲ASCII字符。在這裏，我有21行全文閱讀。這意味着如果我想轉換一個巨大的文件，我不得不花費大量的資源來做到這一點。用Python替換文本中的幾個單詞

您是否有任何想法提高代碼的效率並提高速度，同時減少資源的使用？

# -*- coding: utf-8 -*- 
import re 

# This file contains HTML. 
file = open('input-file.html', 'r') 
temp = file.read() 

# Replace Some XML/HTML characters to ASCII ones. 
temp = temp.replace ('&lsquo;',"""'""") 
temp = temp.replace ('&rsquo;',"""'""") 
temp = temp.replace ('&ldquo;',"""\"""") 
temp = temp.replace ('&rdquo;',"""\"""") 
temp = temp.replace ('&sbquo;',""",""") 
temp = temp.replace ('&prime;',"""'""") 
temp = temp.replace ('&Prime;',"""\"""") 
temp = temp.replace ('&laquo;',"""«""") 
temp = temp.replace ('&raquo;',"""»""") 
temp = temp.replace ('&lsaquo;',"""‹""") 
temp = temp.replace ('&rsaquo;',"""›""") 
temp = temp.replace ('&amp;',"""&""") 
temp = temp.replace ('&ndash;',""" – """) 
temp = temp.replace ('&mdash;',""" — """) 
temp = temp.replace ('&reg;',"""®""") 
temp = temp.replace ('&copy;',"""©""") 
temp = temp.replace ('&trade;',"""™""") 
temp = temp.replace ('&para;',"""¶""") 
temp = temp.replace ('&bull;',"""•""") 
temp = temp.replace ('&middot;',"""·""") 

# Replace HTML tags with an empty string. 
result = re.sub("<.*?>", "", temp) 
print(result) 

# Write the result to a new file. 
file = open("output-file.txt", "w") 
file.write(result) 
file.close()

來源

2015-04-02 ANB

使用sting.tranlate()或string.maketran()的問題是，當我使用它們我不得不甲炭分配給另一個。例如

print string.maketran("abc","123")

但是，我需要分配一個HTML/XML字符像‘在ASCII單引號（'）。這意味着我不得不使用下面的代碼：

print string.maketran("&lsquo;","'")

它面臨着以下錯誤：

ValueError: maketrans arguments must have same length

然而，如果我使用的HTMLParser，它將所有的HTML/XML轉換爲ASCII沒有上述問題。我還添加了一個encode('utf-8')解決以下錯誤：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 246: ordinal not in range(128)

# -*- coding: utf-8 -*- 
import re 
from HTMLParser import HTMLParser 

# This file contains HTML. 
file = open('input-file.txt', 'r') 
temp = file.read() 

# Replace all XML/HTML characters to ASCII ones. 
temp = HTMLParser.unescape.__func__(HTMLParser, temp) 

# Replace HTML tags with an empty string. 
result = re.sub("<.*?>", "", temp) 

# Encode the text to UTF-8 for preventing some errors. 
result = result.encode('utf-8') 
print(result) 

# Write the result to a new file. 
file = open("output-file.txt", "w") 
file.write(result) 
file.close()

來源

2015-04-02 08:16:03 ANB

你可以使用string.translate（）

from string import maketrans # Required to call maketrans function. 

intab = "string of original characters that need to be replaced" 
outtab = "string of new characters" 
trantab = maketrans(intab, outtab)# maketrans() is helper function in the string module to create a translation table 

str = "this is string example....wow!!!";#you string 
print str.translate(trantab);

注意，在python3 str.translate會比python2顯著慢，特別是如果你只翻譯幾個字符。這是因爲它必須處理unicode字符，因此使用字典來執行翻譯而不是索引字符串。

來源

2015-04-02 07:03:22

我的第一本能是string.translate()與string.maketrans()的組合這隻會讓一次而不是幾次。每次撥打str.replace()都會自行傳遞整個字符串，並且您希望避免這種情況。

一個例子：

from string import ascii_lowercase, maketrans, translate 

from_str = ascii_lowercase 
to_str = from_str[-1]+from_str[0:-1] 
foo = 'the quick brown fox jumps over the lazy dog.' 
bar = translate(foo, maketrans(from_str, to_str)) 
print bar # sgd pthbj aqnvm enw itlor nudq sgd kzyx cnf.

來源

2015-04-02 07:11:22 Shashank

用Python替換文本中的幾個單詞

回答

相關問題