這是一個概念驗證碼,可以讓您的想法正常工作,只是爲了讓您知道,BeautifulSoup4真的很強大,它足以滿足您的第一階段在刮。
此外,您還需要閱讀CNN的服務條款,檢查是否允許刮擦。您可以在BS4文檔中找到以下代碼的每個細節的解釋,或者您可以通過計算器開始您的職業生涯,從社區中學習每一個細節,就像我所做的一樣:)祝您好運並享受它!
from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import re
def main():
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = 'http://www.cnn.com/2013/10/29/us/florida-shooting-cell-phone-blocks-bullet/index.html?hpt=ju_c2'
soup = BeautifulSoup(opener.open(url))
#1) Link to the website
#2) Date article published
date = soup.find("div", {"class":"cnn_strytmstmp"}).text.encode('utf-8')
#3) title of article
title = soup.find("div", {"id":"cnnContentContainer"}).find('h1').text.encode('utf-8')
#4) Text of the article
paragraphs = soup.find('div', {"class":"cnn_strycntntlft"}).find_all('p')
text = " ".join([ paragraph.text.encode('utf-8') for paragraph in paragraphs])
print url
print date
print title
print text
if __name__ == '__main__':
main()
輸出看起來是這樣的:
http://www.cnn.com/2013/10/29/us/florida-shooting-cell-phone-blocks-bullet/index.html?hpt=ju_c2
updated 7:34 AM EDT, Tue October 29, 2013
Cell phone stops bullet aimed at Florida gas station clerk
(CNN) -- A gas station clerk's smartphone may... the TV station reported.
同時,我們應該如何定位的元素一點從我的哲學:link here. 和硒/ Scrapy您還可能以後遇到..
乙W.先生感謝您的回答。什麼是'utf-8'? – intelligentlywrong
@intelligentlywrong UTF-8告訴蟒使用UTF-8這是怎麼瀏覽器解碼解釋文本。 http://en.wikipedia.org/wiki/UTF-8。否則,sth.text將返回字符串的Unicode來代替。 –
我試圖運行你的代碼,但我得到的錯誤爲'沒有名爲「urllib2''模塊。我有Python 2.7使用Anaconda。 –