2010-01-07 25 views
2

我有一個小腳本,它使用urllib2來獲取網站內容,查找所有鏈接標記,在頂部和底部附加一小段HTML,然後嘗試對其進行美化。它一直返回TypeError:序列項目1:期望的字符串,找到標籤。我環顧四周,我找不到問題。與往常一樣,任何幫助,非常感謝。無法在BeautifulSoup中對美化的html進行美化

import urllib2 
from BeautifulSoup import BeautifulSoup 
import re 

reddit = 'http://www.reddit.com' 
pre = '<html><head><title>Page title</title></head>' 
post = '</html>' 
site = urllib2.urlopen(reddit) 
html=site.read() 
soup = BeautifulSoup(html) 
tags = soup.findAll('a') 
tags.insert(0,pre) 
tags.append(post) 
soup1 = BeautifulSoup(''.join(tags)) 
print soup1.prettify() 

這是追溯:

Traceback (most recent call last): File "C:\Python26\bea.py", line 21, in <module> 
     soup1 = BeautifulSoup(''.join(tags)) 
TypeError: sequence item 1: expected string, Tag found 
+0

是啊,這是追溯: 回溯(最近通話最後一個): 文件 「C:\ Python26 \ bea.py」,第21行,在 soup1 = BeautifulSoup( '' 加入(標籤)。) 類型錯誤:序列項目1:期望的字符串,標籤發現 – Kevin 2010-01-07 17:03:52

回答

2

這個工作對我來說:

soup1 = BeautifulSoup(''.join(str(t) for t in tags)) 

這pyparsing解決方案提供了一些體面的輸出,也:

from pyparsing import makeHTMLTags, originalTextFor, SkipTo, Combine 

# makeHTMLTags defines HTML tag patterns for given tag string 
aTag,aEnd = makeHTMLTags("A") 

# makeHTMLTags by default returns a structure containing 
# the tag's attributes - we just want the original input text 
aTag = originalTextFor(aTag) 
aEnd = originalTextFor(aEnd) 

# define an expression for a full link, and use a parse action to 
# combine the returned tokens into a single string 
aLink = aTag + SkipTo(aEnd) + aEnd 
aLink.setParseAction(lambda tokens : ''.join(tokens)) 

# extract links from the input html 
links = aLink.searchString(html) 

# build list of strings for output 
out = [] 
out.append(pre) 
out.extend([' '+lnk[0] for lnk in links]) 
out.append(post) 

print '\n'.join(out) 

打印:

<html><head><title>Page title</title></head> 
    <a href="http://www.reddit.com/r/pics/" >pics</a> 
    <a href="http://www.reddit.com/r/reddit.com/" >reddit.com</a> 
    <a href="http://www.reddit.com/r/politics/" >politics</a> 
    <a href="http://www.reddit.com/r/funny/" >funny</a> 
    <a href="http://www.reddit.com/r/AskReddit/" >AskReddit</a> 
    <a href="http://www.reddit.com/r/WTF/" >WTF</a> 
    . 
    . 
    . 
    <a href="http://reddit.com/help/privacypolicy" >Privacy Policy</a> 
    <a href="#" onclick="return hidecover(this)">close this window</a> 
    <a href="http://www.reddit.com/feedback" >volunteer to translate</a> 
    <a href="#" onclick="return hidecover(this)">close this window</a> 
</html> 
0
soup1 = BeautifulSoup(''.join(unicode(tag) for tag in tags)) 
+0

我說你行,它現在給我一個類型錯誤n BeautifulSoup.py TypeError:期望的字符串或緩衝區 – Kevin 2010-01-07 17:07:16

0

語法錯誤對Jonathans回答了一下,這裏是正確的:

soup1 = BeautifulSoup(''.join([unicode(tag) for tag in tags])) 
+2

而不是使這個答案,也許在喬納森的答案下的評論會更合適。 – 2011-06-08 16:48:36