2012-08-15 30 views
2

在試圖刪除不需要的/不安全的標記沒有屬性「項目」,並從輸入,我使用下面的代碼(由http://djangosnippets.org/snippets/1655/幾乎全部)屬性:「列表」對象在Python的BeautifulSoup renderContents

def html_filter(value, allowed_tags = 'p h1 h2 h3 div span a:href:title img:src:alt:title table:cellspacing:cellpadding th tr td:colspan:rowspan ol ul li br'): 
    js_regex = re.compile(r'[\s]*(&#x.{1,7})?'.join(list('javascript'))) 
    allowed_tags = [tag.split(':') for tag in allowed_tags.split()] 
    allowed_tags = dict((tag[0], tag[1:]) for tag in allowed_tags) 
    soup = BeautifulSoup(value) 
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment)): 
     comment.extract() 
    for tag in soup.findAll(True): 
     if tag.name not in allowed_tags: 
      tag.hidden = True 
     else: 
      tag.attrs = [(attr, js_regex.sub('', val)) for attr, val in tag.attrs.items() if attr in allowed_tags[tag.name]] 
    return soup.renderContents().decode('utf8') 

適用於不需要或列入白名單的標籤,未列入白名單的屬性以及格式不正確的html。但是,如果任何列入白名單屬性存在,它會引發

'list' object has no attribute 'items' 

在最後一行,這是沒有幫助我很多。 type(soup)<class 'bs4.BeautifulSoup'>是否引發錯誤,所以我不知道它指的是什麼。

Traceback: 
[...] 
File "C:\Users\Mark\Web\www\fnwidjango\src\base\functions\html_filter.py" in html_filter 
    30.  return soup.renderContents().decode('utf8') 
File "C:\Python27\lib\site-packages\bs4\element.py" in renderContents 
    1098.    indent_level=indentLevel, encoding=encoding) 
File "C:\Python27\lib\site-packages\bs4\element.py" in encode_contents 
    1089.   contents = self.decode_contents(indent_level, encoding, formatter) 
File "C:\Python27\lib\site-packages\bs4\element.py" in decode_contents 
    1074.         formatter)) 
File "C:\Python27\lib\site-packages\bs4\element.py" in decode 
    1021.    indent_contents, eventual_encoding, formatter) 
File "C:\Python27\lib\site-packages\bs4\element.py" in decode_contents 
    1074.         formatter)) 
File "C:\Python27\lib\site-packages\bs4\element.py" in decode 
    1021.    indent_contents, eventual_encoding, formatter) 
File "C:\Python27\lib\site-packages\bs4\element.py" in decode_contents 
    1074.         formatter)) 
File "C:\Python27\lib\site-packages\bs4\element.py" in decode 
    1021.    indent_contents, eventual_encoding, formatter) 
File "C:\Python27\lib\site-packages\bs4\element.py" in decode_contents 
    1074.         formatter)) 
File "C:\Python27\lib\site-packages\bs4\element.py" in decode 
    983.    for key, val in sorted(self.attrs.items()): 

Exception Type: AttributeError at /"nieuws"/article/3-test/ 
Exception Value: 'list' object has no attribute 'items' 
+0

你確定'tag.attrs'不應該是一本字典嗎? (它開始作爲一個,你把它改變成一個列表) – mgilson 2012-08-15 23:58:26

+0

我想記住爲什麼我改變了這一點......到目前爲止,當我改變它回來,似乎沒有什麼破碎,也許我的大腦出現故障。 – Mark 2012-08-16 00:46:45

回答

3

嘗試

tag.attrs = dict((attr, js_regex.sub('', val)) for attr, val in tag.attrs.items() if attr in allowed_tags[tag.name]) 
1

更換

tag.attrs = [(attr, js_regex.sub('', val)) for attr, val in tag.attrs.items() if attr in allowed_tags[tag.name]] 

看起來renderContents()希望您能設置attrsdict(這將有items方法),而不是名單你傳遞的元組。因此它在嘗試訪問它時會拋出AttributeError

要修正這個錯誤,你可以在Python 3使用字典理解:

tag.attrs = {attr: js_regex.sub('', val)) for attr, val in tag.attrs.items() if attr in allowed_tags[tag.name]} 

在Python 2,字典內涵是不支持的,所以你應該通過一個迭代器dict構造:

tag.attrs = dict((attr, js_regex.sub('', val)) for attr, val in tag.attrs.items() if attr in allowed_tags[tag.name]))