無法在BeautifulSoup中對美化的html進行美化

我有一個小腳本，它使用urllib2來獲取網站內容，查找所有鏈接標記，在頂部和底部附加一小段HTML，然後嘗試對其進行美化。它一直返回TypeError：序列項目1：期望的字符串，找到標籤。我環顧四周，我找不到問題。與往常一樣，任何幫助，非常感謝。無法在BeautifulSoup中對美化的html進行美化

import urllib2 
from BeautifulSoup import BeautifulSoup 
import re 

reddit = 'http://www.reddit.com' 
pre = '<html><head><title>Page title</title></head>' 
post = '</html>' 
site = urllib2.urlopen(reddit) 
html=site.read() 
soup = BeautifulSoup(html) 
tags = soup.findAll('a') 
tags.insert(0,pre) 
tags.append(post) 
soup1 = BeautifulSoup(''.join(tags)) 
print soup1.prettify()

這是追溯：

Traceback (most recent call last): File "C:\Python26\bea.py", line 21, in <module> 
     soup1 = BeautifulSoup(''.join(tags)) 
TypeError: sequence item 1: expected string, Tag found

來源

2010-01-07 Kevin

是啊，這是追溯：回溯（最近通話最後一個）：文件「C：\ Python26 \ bea.py」，第21行，在 soup1 = BeautifulSoup（ '' 加入（標籤）。）類型錯誤：序列項目1：期望的字符串，標籤發現 – Kevin 2010-01-07 17:03:52

這個工作對我來說：

soup1 = BeautifulSoup(''.join(str(t) for t in tags))

這pyparsing解決方案提供了一些體面的輸出，也：

from pyparsing import makeHTMLTags, originalTextFor, SkipTo, Combine 

# makeHTMLTags defines HTML tag patterns for given tag string 
aTag,aEnd = makeHTMLTags("A") 

# makeHTMLTags by default returns a structure containing 
# the tag's attributes - we just want the original input text 
aTag = originalTextFor(aTag) 
aEnd = originalTextFor(aEnd) 

# define an expression for a full link, and use a parse action to 
# combine the returned tokens into a single string 
aLink = aTag + SkipTo(aEnd) + aEnd 
aLink.setParseAction(lambda tokens : ''.join(tokens)) 

# extract links from the input html 
links = aLink.searchString(html) 

# build list of strings for output 
out = [] 
out.append(pre) 
out.extend([' '+lnk[0] for lnk in links]) 
out.append(post) 

print '\n'.join(out)

個

打印：

<html><head><title>Page title</title></head> 
    <a href="http://www.reddit.com/r/pics/" >pics</a> 
    <a href="http://www.reddit.com/r/reddit.com/" >reddit.com</a> 
    <a href="http://www.reddit.com/r/politics/" >politics</a> 
    <a href="http://www.reddit.com/r/funny/" >funny</a> 
    <a href="http://www.reddit.com/r/AskReddit/" >AskReddit</a> 
    <a href="http://www.reddit.com/r/WTF/" >WTF</a> 
    . 
    . 
    . 
    <a href="http://reddit.com/help/privacypolicy" >Privacy Policy</a> 
    <a href="#" onclick="return hidecover(this)">close this window</a> 
    <a href="http://www.reddit.com/feedback" >volunteer to translate</a> 
    <a href="#" onclick="return hidecover(this)">close this window</a> 
</html>

來源

2010-01-07 23:55:27 PaulMcG

soup1 = BeautifulSoup(''.join(unicode(tag) for tag in tags))

來源

2010-01-07 16:59:04

我說你行，它現在給我一個類型錯誤n BeautifulSoup.py TypeError：期望的字符串或緩衝區 – Kevin 2010-01-07 17:07:16

語法錯誤對Jonathans回答了一下，這裏是正確的：

soup1 = BeautifulSoup(''.join([unicode(tag) for tag in tags]))

來源

2011-06-08 15:32:06

而不是使這個答案，也許在喬納森的答案下的評論會更合適。 – 2011-06-08 16:48:36

無法在BeautifulSoup中對美化的html進行美化

回答

相關問題