Scraping htmlpage

我想從代碼中給出的html頁面中獲得五部電影的電影評分，年份，評分，流派和運行時間。這些位於稱爲結果的表格行中。Scraping htmlpage

from bs4 import BeautifulSoup 
import urllib2 

def read_from_url(url, num_m=5): 
    html_string = urllib2.urlopen(url) 
    soup = BeautifulSoup(html_string) 
    movie_table = soup.find('table', 'results') # table of movie 
    list_movies = [] 
    count = 0 
    for row in movie_table.find_all("tr"): 
     dict_each_movie = {} 
     title = title.encode("ascii", "ignore") # getting title 
     dict_each_movie["title"] = title 
     year = year.encode("ascii","ignore")  # getting year 
     dict_each_movie["year"] = year 
     rank = rank.encode("ascii","ignore")  # getting rank 
     dict_each_movie["rank"] = rank 
     # genres = [] # getting genres of a movie 
     runtime = runtime.encode("ascii","ignore")  # getting rank 
     dict_each_movie["runtime"] = runtime 
     list_movies.append(dict_each_movie) 
     count+=1 
     if count==num_of_m: 
      break 
    return list_movies 

print read_from_url('http://www.imdb.com/search/title?at=0&sort=user_rating&start=1&title_type=feature&year=2005,2015',2)

預期輸出：

[{'rating': '10.0', 'genres': ['Comedy', 'Family'], 'title': 'How to Beat a Bully', 'rank': '1', 'year': '2014', 'runtime': '90'},..........]

來源

2015-02-05 Alph

title = title.encode（「ascii」，「ignore」）是什麼標題？ – 2015-02-05 14:50:12

您正在訪問尚未聲明的變量。當口譯人員看到title.encode("ascii", "ignore")時，它會查找以前未聲明的變量title。 Python不可能知道title是什麼，因此你不能調用encode就可以了。年份和排名也一樣。相反使用：

title = 'How to Beat a Bully'.encode('ascii','ignore')

來源

2015-02-05 14:58:42 runDOSrun

@ runDosrun，如何自動執行此操作？我應該硬編碼的標題，年份，運行時間等？ – Alph 2015-02-05 15:07:29

我的回答解決了您的錯誤。爲了實現您的目標，您必須閱讀更多關於恐懼的基礎知識，請參閱http://www.jayrambhia.com/blog/fetch-movie-details-from-imdb-using-python-with-proxy/ – runDOSrun 2015-02-05 15:12:05

@runDosun。我閱讀了美麗的湯的文檔以及您在評論中提供的鏈接。該示例使用soup.find獲取標題。在我的情況下，在檢查第一部電影 The Forsaken Pages的元素後，標題出現在結果表格中（對於每部電影）。有沒有什麼方法可以檢索這個標題（被遺忘的頁面）。 – Alph 2015-02-06 03:55:15

爲什麼這樣？

使用CSS選擇器讓您的生活更輕鬆。

<table> 
<tr class="my_class"> 
    <td id="id_here"> 

    <a href = "link_here"/>First Link</a> 

    </td> 
    <td id="id_here"> 

    <a href = "link_here"/>Second Link</a> 

    </td> 
</tr> 
</table> 

    for tr in movie_table.select("tr.my_class"): 
      for td in tr.select("td#id_here"): 
       print("Link " + td.select("a")[0]["href"]) 
       print("Text "+ td.select("a")[0].text)

來源

2015-02-06 14:33:02 Umair

回答

相關問題