2011-07-31 20 views
3

當我在Python中處理HTML代碼時,由於特殊字符,我必須使用以下代碼。使string.replace語句的序列更具可讀性

line = string.replace(line, """, "\"") 
line = string.replace(line, "'", "'") 
line = string.replace(line, "&", "&") 
line = string.replace(line, "&lt;", "<") 
line = string.replace(line, "&gt;", ">") 
line = string.replace(line, "&laquo;", "<<") 
line = string.replace(line, "&raquo;", ">>") 
line = string.replace(line, "&#039;", "'") 
line = string.replace(line, "&#8220;", "\"") 
line = string.replace(line, "&#8221;", "\"") 
line = string.replace(line, "&#8216;", "\'") 
line = string.replace(line, "&#8217;", "\'") 
line = string.replace(line, "&#9632;", "") 
line = string.replace(line, "&#8226;", "-") 

看來會有更多這樣的特殊字符,我必須替換。你知道如何讓這個代碼更優雅嗎?

感謝

+0

可能重複的[在Python字符串解碼HTML實體?](http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string)在'string' –

+1

'string.replace'和最相似的功能MODU樂已被棄用:http://docs.python.org/library/string.html#deprecated-string-functions –

+0

@Ben詹姆斯謝謝,這個解決方案是適合我的,但它不是一個重複的,因爲我可能要打另一個替換序列(例如, 1000個替代品根據別的東西而不是HTML特殊字符) – xralf

回答

4
REPLACEMENTS = [ 
    ("&quot;", "\""), 
    ("&apos;", "'"), 
    ... 
    ] 
for entity, replacement in REPLACEMENTS: 
    line = line.replace(entity, replacement) 

你注意string.replace簡直可以作爲str/unicode對象的方法。

更重要的是,退房this question

你的問題的標題問不同的東西,但:優化,即使其運行速度更快。這是一個完全不同的問題,需要更多的工作。

+0

你說得對,我會改變「優化」這個詞爲「可讀性」 – xralf

+2

當心替代的順序永遠是重要的,當你更換'」 &‘'和'’和‘'它可能會變成'’& LT;如果你按照錯誤的順序進行操作,在'<'中加入''''。如果你有一個通用的替換模式,你可以用're.sub'來查找它,並使用一個函數來獲得替換(用於HTML實體之類的東西)。 –

2

下面是一些代碼,我寫了一段時間後進行解碼HTML實體。請注意,它適用於Python 2.x,因此它也可以從str解碼爲unicode:如果您使用的是現代Python,則可以放棄該位。我認爲它可以處理任何指定的實體,十進制和十六進制實體。出於某種原因,「者」是不是在命名實體的Python的字典,所以我第一次將它複製並添加缺少的一個:

from htmlentitydefs import name2codepoint 
name2codepoint = name2codepoint.copy() 
name2codepoint['apos']=ord("'") 

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));') 
def decodeEntities(s, encoding='utf-8'): 
    def unescape(match): 
     code = match.group(1) 
     if code: 
      return unichr(int(code, 10)) 
     else: 
      code = match.group(2) 
      if code: 
       return unichr(int(code, 16)) 
      else: 
       code = match.group(3) 
       if code in name2codepoint: 
        return unichr(name2codepoint[code]) 
     return match.group(0) 

    if isinstance(s, str): 
     s = s.decode(encoding) 
    return EntityPattern.sub(unescape, s) 
2

優化

REPL_tu = (("&quot;", "\"") , ("&apos;", "'") , ("&amp;", "&") , 
      ("&lt;", "<") , ("&gt;", ">") , 
      ("&laquo;", "<<") , ("&raquo;", ">>") , 
      ("&#039;", "'") , 
      ("&#8220;", "\"") , ("&#8221;", "\"") , 
      ("&#8216;", "\'") , ("&#8217;", "\'") , 
      ("&#9632;", "") , ("&#8226;", "-") ) 

def repl(mat, d = dict(REPL_tu)): 
    return d[mat.group()] 

import re 
regx = re.compile('|'.join(a for a,b in REPL_tu)) 

line = 'A tag &lt;bidi&gt; has a &quot;weird&#8220;&#8226;&apos;content&apos;' 
modline = regx.sub(repl,line) 
print 'Exemple:\n\n'+line+'\n'+modline 








from urllib import urlopen 

print '\n-----------------------------------------\nDownloading a web source:\n' 
sock = urlopen('http://www.mythicalcreaturesworld.com/greek-mythology/monsters/python-the-serpent-of-delphi-%E2%80%93-python-the-guardian-dragon-and-apollo/') 
html_source = sock.read() 
sock.close() 

from time import clock 

n = 100 

te = clock() 
for i in xrange(n): 
    res1 = html_source 
    res1 = regx.sub(repl,res1) 
print 'with regex ',clock()-te,'seconds' 


te = clock() 
for i in xrange(n): 
    res2 = html_source 
    for entity, replacement in REPL_tu: 
     res2 = res2.replace(entity, replacement) 
print 'with replace',clock()-te,'seconds' 

print res1==res2 

結果

Exemple: 

A tag &lt;bidi&gt; has a &quot;weird&#8220;&#8226;&apos;content&apos; 
A tag <bidi> has a "weird"-'content' 

----------------------------------------- 
Downloading a web source: 

with regex 0.097578323502 seconds 
with replace 0.213866846205 seconds 
True