2016-12-26 36 views
2

我正在使用BeautifulSoup來解析html文章。我使用一些函數來清除html,所以我只能保留主要文章。「unicode」對象沒有「美化」屬性

此外,我想將湯輸出保存到一個文件。我得到的錯誤是:

soup = soup.prettify("utf-8") 
AttributeError: 'unicode' object has no attribute 'prettify' 

源代碼:

#!/usr/bin/env python 
import urllib2 
from bs4 import BeautifulSoup 
import nltk 
import argparse 

def cleaner(): 
    url = "https://www.ceid.upatras.gr/en/announcements/job-offers/full-stack-web-developer-papergo" 
    ourUrl = urllib2.urlopen(url).read() 
    soup = BeautifulSoup(ourUrl) 

    #remove scripts 
    for script in soup.find_all('script'): 
     script.extract() 
    soup = soup.find("div", class_="clearfix") 

    #below code will delete tags except /br 
    soup = soup.encode('utf-8') 
    soup = soup.replace('<br/>' , '^') 
    soup = BeautifulSoup(soup) 
    soup = (soup.get_text()) 
    soup=soup.replace('^' , '<br/>') 

    print soup 
    with open('out.txt','w',encoding='utf-8-sig') as f: 
     f.write(soup.prettify()) 

if __name__ == '__main__': 
    cleaner() 

回答

2

這是因爲soup不是這些行後面再一個BeautifulSoupTag例如:

soup = (soup.get_text()) 
soup = soup.replace('^' , '<br/>') 

它成爲一個unicode字符串,當然,它沒有.prettify()方法。

根據您所需的輸出是什麼,你應該能夠使.get_text().replace_with().unwrap().extract()BeautifulSoup方法的清理你的HTML,而不是試圖對付它作爲一個普通字符串。

相關問題