2016-07-07 74 views
1

這就是Python 3.5.x的 我要找的是要找到頭,HTML代碼是提取並格式化網站數據Python

<h3 class = "title-link__title"><span class="title=link__text">News Here</span> 

with urllib.request.urlopen('http://www.bbc.co.uk/news') as r: 
    HTML = r.read() 
    HTML = list(HTML) 
    for i in range(len(HTML)): 
     HTML[i] = chr(HTML[i]) 

的peice的後,我怎樣才能得到它,我只是提取標題,因爲這是我所需要的。無論如何,我會盡力幫助細節。

+0

您是否嘗試過使用正則表達式?另外,您可能需要明確說明您希望程序從上述HTML中提取的內容。 –

+0

謝謝,但我已經使用BeautifulSoup工作,並且我正在尋找會頻繁更改的標題。 –

回答

1

從網頁提取信息稱爲web scraping

完成這項工作的最佳工具之一是BeautifulSoup庫。

from bs4 import BeautifulSoup 
import urllib 

#opening page 
r = urllib.urlopen('http://www.bbc.co.uk/news').read() 
#creating soup 
soup = BeautifulSoup(r) 

#useful for understanding the layout of your page info 
#print soup.prettify() 

#creating a ResultSet with all h3 tags that contains a class named 'title-link__title' 
a = soup.findAll("h3", {"class":"title-link__title"}) 

#counting ocurrences 
len(a) 
#result = 44 

#get text of first header 
a[0].text 
#result = u'\nMay v Leadsom to be next UK PM\n' 

#get text of second header 
a[1].text 
#result = u'\nVideo shows US police shooting aftermath\n'