僅從Python中的網頁內容下載文本

-1

如何從Python中的網頁下載只有 text/html/javascript？僅從Python中的網頁內容下載文本

我想了解一些關於博客作者撰寫的文本的統計信息。只需要文本，我想通過避免下載圖像等來提高我的程序速度。

我可以將文本從HTML標記語言中分離出來。所以，我的本意主要是避免在網頁中下載aditional的內容（如圖片，瑞士法郎等）

到目前爲止我用：

user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3' 
     headers = {'User-Agent': user_agent} 
req = urllib2.Request(url, None, headers) 
response = urllib2.urlopen(req, timeout=60) 
content_type = response.info().getheader('Content-Type') 
if 'text/html' in content_type: 
    return response.read()

但我不知道如果我做了（只有IE下載文本）正確的事情

來源

2015-06-20 El Marce

我會建議看[要求]（http://docs.python-requests.org/en/latest /）庫f或者更容易處理HTTP請求。 – Ben

Python的BeautifulSoup的最好的一個解析網頁

import bs4 
import urllib.request 

webpage=str(urllib.request.urlopen(link).read()) 
soup = bs4.BeautifulSoup(webpage) 

print(soup.get_text())

來源

2015-06-20 08:09:32 mmachine

我想要這樣做是出於性能原因（我會更新我的問題。）。所以，我不知道你的回答是否適合我。然而，它很有用，所以+1 –

僅從Python中的網頁內容下載文本

回答

相關問題