在python中颳去網頁

我對抓取網頁完全陌生，但我真的很想在python中學習它。我對python有基本的瞭解。在python中颳去網頁

我無法理解代碼來刮網頁，因爲我找不到有關代碼使用的模塊的良好文檔。

代碼下腳料this網頁的一些電影的數據

我卡住了評論「評選的模式如下CSS規則」之後。

我想了解該代碼背後的邏輯或理解該模塊的好文檔。以前有沒有我需要學習的話題？

的代碼如下：

import requests 
from pattern import web 
from BeautifulSoup import BeautifulSoup 

url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012' 
r = requests.get(url) 
print r.url 

url = 'http://www.imdb.com/search/title' 
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012') 
r = requests.get(url, params=params) 
print r.url # notice it constructs the full url for you 

#selection in pattern follows the rules of CSS 

dom = web.Element(r.text) 
for movie in dom.by_tag('td.title'):  
    title = movie.by_tag('a')[0].content 
    genres = movie.by_tag('span.genre')[0].by_tag('a') 
    genres = [g.content for g in genres] 
    runtime = movie.by_tag('span.runtime')[0].content 
    rating = movie.by_tag('span.value')[0].content 
    print title, genres, runtime, rating

來源

2014-01-12 CreamStat

下面是BeautifulSoup的文檔，這是一個HTML和XML解析器。

選擇的模式如下CSS規則

意味着字符串，如'td.title'和'span.runtime'是CSS選擇器，可以幫助找到你所尋找的，其中td.title搜索數據的註釋對於屬性爲class="title"的<TD>元素。

該代碼正在遍歷網頁正文中的HTML元素，並通過CSS選擇器提取標題，流派，運行時和評級。

來源

2014-01-12 04:17:00 haferje

在python中颳去網頁

回答

相關問題