使string.replace語句的序列更具可讀性

當我在Python中處理HTML代碼時，由於特殊字符，我必須使用以下代碼。使string.replace語句的序列更具可讀性

line = string.replace(line, "&quot;", "\"") 
line = string.replace(line, "&apos;", "'") 
line = string.replace(line, "&amp;", "&") 
line = string.replace(line, "&lt;", "<") 
line = string.replace(line, "&gt;", ">") 
line = string.replace(line, "&laquo;", "<<") 
line = string.replace(line, "&raquo;", ">>") 
line = string.replace(line, "&#039;", "'") 
line = string.replace(line, "&#8220;", "\"") 
line = string.replace(line, "&#8221;", "\"") 
line = string.replace(line, "&#8216;", "\'") 
line = string.replace(line, "&#8217;", "\'") 
line = string.replace(line, "&#9632;", "") 
line = string.replace(line, "&#8226;", "-")

看來會有更多這樣的特殊字符，我必須替換。你知道如何讓這個代碼更優雅嗎？

感謝

來源

2011-07-31 xralf

可能重複的[在Python字符串解碼HTML實體？]（http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string）在'string' –

'string.replace'和最相似的功能MODU樂已被棄用：http://docs.python.org/library/string.html#deprecated-string-functions –

@Ben詹姆斯謝謝，這個解決方案是適合我的，但它不是一個重複的，因爲我可能要打另一個替換序列（例如， 1000個替代品根據別的東西而不是HTML特殊字符） – xralf

REPLACEMENTS = [ 
    ("&quot;", "\""), 
    ("&apos;", "'"), 
    ... 
    ] 
for entity, replacement in REPLACEMENTS: 
    line = line.replace(entity, replacement)

你注意string.replace簡直可以作爲str/unicode對象的方法。

更重要的是，退房this question！

你的問題的標題問不同的東西，但：優化，即使其運行速度更快。這是一個完全不同的問題，需要更多的工作。

來源

2011-07-31 11:32:38 Thomas

你說得對，我會改變「優化」這個詞爲「可讀性」 – xralf

當心替代的順序永遠是重要的，當你更換'」 &‘'和'’和‘'它可能會變成'’& LT;如果你按照錯誤的順序進行操作，在'<'中加入''''。如果你有一個通用的替換模式，你可以用're.sub'來查找它，並使用一個函數來獲得替換（用於HTML實體之類的東西）。 –

下面是一些代碼，我寫了一段時間後進行解碼HTML實體。請注意，它適用於Python 2.x，因此它也可以從str解碼爲unicode：如果您使用的是現代Python，則可以放棄該位。我認爲它可以處理任何指定的實體，十進制和十六進制實體。出於某種原因，「者」是不是在命名實體的Python的字典，所以我第一次將它複製並添加缺少的一個：

from htmlentitydefs import name2codepoint 
name2codepoint = name2codepoint.copy() 
name2codepoint['apos']=ord("'") 

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));') 
def decodeEntities(s, encoding='utf-8'): 
    def unescape(match): 
     code = match.group(1) 
     if code: 
      return unichr(int(code, 10)) 
     else: 
      code = match.group(2) 
      if code: 
       return unichr(int(code, 16)) 
      else: 
       code = match.group(3) 
       if code in name2codepoint: 
        return unichr(name2codepoint[code]) 
     return match.group(0) 

    if isinstance(s, str): 
     s = s.decode(encoding) 
    return EntityPattern.sub(unescape, s)

來源

2011-07-31 12:14:19 Duncan

優化

REPL_tu = (("&quot;", "\"") , ("&apos;", "'") , ("&amp;", "&") , 
      ("&lt;", "<") , ("&gt;", ">") , 
      ("&laquo;", "<<") , ("&raquo;", ">>") , 
      ("&#039;", "'") , 
      ("&#8220;", "\"") , ("&#8221;", "\"") , 
      ("&#8216;", "\'") , ("&#8217;", "\'") , 
      ("&#9632;", "") , ("&#8226;", "-") ) 

def repl(mat, d = dict(REPL_tu)): 
    return d[mat.group()] 

import re 
regx = re.compile('|'.join(a for a,b in REPL_tu)) 

line = 'A tag &lt;bidi&gt; has a &quot;weird&#8220;&#8226;&apos;content&apos;' 
modline = regx.sub(repl,line) 
print 'Exemple:\n\n'+line+'\n'+modline 








from urllib import urlopen 

print '\n-----------------------------------------\nDownloading a web source:\n' 
sock = urlopen('http://www.mythicalcreaturesworld.com/greek-mythology/monsters/python-the-serpent-of-delphi-%E2%80%93-python-the-guardian-dragon-and-apollo/') 
html_source = sock.read() 
sock.close() 

from time import clock 

n = 100 

te = clock() 
for i in xrange(n): 
    res1 = html_source 
    res1 = regx.sub(repl,res1) 
print 'with regex ',clock()-te,'seconds' 


te = clock() 
for i in xrange(n): 
    res2 = html_source 
    for entity, replacement in REPL_tu: 
     res2 = res2.replace(entity, replacement) 
print 'with replace',clock()-te,'seconds' 

print res1==res2

結果

Exemple: 

A tag &lt;bidi&gt; has a &quot;weird&#8220;&#8226;&apos;content&apos; 
A tag <bidi> has a "weird"-'content' 

----------------------------------------- 
Downloading a web source: 

with regex 0.097578323502 seconds 
with replace 0.213866846205 seconds 
True

來源

2011-07-31 13:59:43 eyquem

使string.replace語句的序列更具可讀性

回答

相關問題