提取並格式化網站數據Python

這就是Python 3.5.x的我要找的是要找到頭，HTML代碼是提取並格式化網站數據Python

<h3 class = "title-link__title"><span class="title=link__text">News Here</span> 

with urllib.request.urlopen('http://www.bbc.co.uk/news') as r: 
    HTML = r.read() 
    HTML = list(HTML) 
    for i in range(len(HTML)): 
     HTML[i] = chr(HTML[i])

的peice的後，我怎樣才能得到它，我只是提取標題，因爲這是我所需要的。無論如何，我會盡力幫助細節。

來源

2016-07-07 Byron Filer

您是否嘗試過使用正則表達式？另外，您可能需要明確說明您希望程序從上述HTML中提取的內容。 –

謝謝，但我已經使用BeautifulSoup工作，並且我正在尋找會頻繁更改的標題。 –

從網頁提取信息稱爲web scraping。

完成這項工作的最佳工具之一是BeautifulSoup庫。

from bs4 import BeautifulSoup 
import urllib 

#opening page 
r = urllib.urlopen('http://www.bbc.co.uk/news').read() 
#creating soup 
soup = BeautifulSoup(r) 

#useful for understanding the layout of your page info 
#print soup.prettify() 

#creating a ResultSet with all h3 tags that contains a class named 'title-link__title' 
a = soup.findAll("h3", {"class":"title-link__title"}) 

#counting ocurrences 
len(a) 
#result = 44 

#get text of first header 
a[0].text 
#result = u'\nMay v Leadsom to be next UK PM\n' 

#get text of second header 
a[1].text 
#result = u'\nVideo shows US police shooting aftermath\n'

來源

2016-07-07 20:33:12

提取並格式化網站數據Python

回答

相關問題