我使用下面的代碼從文件中刪除所有HTML標記並將其轉換爲純文本。此外,我必須將XML/HTML字符轉換爲ASCII字符。在這裏,我有21行全文閱讀。這意味着如果我想轉換一個巨大的文件,我不得不花費大量的資源來做到這一點。用Python替換文本中的幾個單詞
您是否有任何想法提高代碼的效率並提高速度,同時減少資源的使用?
# -*- coding: utf-8 -*-
import re
# This file contains HTML.
file = open('input-file.html', 'r')
temp = file.read()
# Replace Some XML/HTML characters to ASCII ones.
temp = temp.replace ('‘',"""'""")
temp = temp.replace ('’',"""'""")
temp = temp.replace ('“',"""\"""")
temp = temp.replace ('”',"""\"""")
temp = temp.replace ('‚',""",""")
temp = temp.replace ('′',"""'""")
temp = temp.replace ('″',"""\"""")
temp = temp.replace ('«',"""«""")
temp = temp.replace ('»',"""»""")
temp = temp.replace ('‹',"""‹""")
temp = temp.replace ('›',"""›""")
temp = temp.replace ('&',"""&""")
temp = temp.replace ('–',""" – """)
temp = temp.replace ('—',""" — """)
temp = temp.replace ('®',"""®""")
temp = temp.replace ('©',"""©""")
temp = temp.replace ('™',"""™""")
temp = temp.replace ('¶',"""¶""")
temp = temp.replace ('•',"""•""")
temp = temp.replace ('·',"""·""")
# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)
print(result)
# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()