在python中從鏈接過濾信息？

-1

所以我「米用Python寫一個程序來拉評級的電影，從我最喜愛的網站之一在python中從鏈接過濾信息？

實例鏈接查看： http://timesofindia.indiatimes.com/entertainment/movie-reviews/hindi/Madras-Cafe-movie-review/movie-review/21975443.cms

目前，我使用string.partition命令，以獲得HTML源代碼的部分，其中包含評級信息。然而，這種方法是極其緩慢。

會是什麼讓這部電影的評級最快的方法是什麼？

這裏是我的代碼m使用：

#POST Request to TOI site, for review source 
data_output = requests.post(review_link) 

#Clean HTML code 
soup = BeautifulSoup(data_output.text) 

#Filter source data, via a dirty string partition method 

#rating 
texted = str(soup).partition(" stars,") 
texted = texted[0].partition("Rating: ") 
rating = texted[2] 
#title 
texted = texted[0].partition(" movie review") 
texted = texted[0].partition("<title>") 
title = texted[2] 

#print stuff 
print "Title:", title 
print "Rating:", rating, "/ 5"

謝謝！

來源

2013-08-26 Bear Hugger

使用實際的HTML解析器會很有幫助;像[BeautifulSoup]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/）。 –

發佈你的代碼的例子也將是有益的 – ScottJShea

我試過BeautifulSoup，但是，它需要更長的時間，因爲沒有真正的HTML標籤持有評級。相反，我不得不使用search_all方法，這同樣耗時。 –

下面是使用requests拿到HTML，lxml解析HTML和獲取評價值re提取實際評分爲一個數字的例子，：

import re 
from lxml import etree 
import requests 

URL = "http://timesofindia.indiatimes.com/entertainment/movie-reviews/hindi/Madras-Cafe-movie-review/movie-review/21975443.cms" 

response = requests.get(URL) 

parser = etree.HTMLParser() 
root = etree.fromstring(response.text, parser=parser) 
rating_text = root.find('.//div[@id="sshow"]/table/tr/td[2]/div[1]/script[1]').text # prints fbcriticRating="4"; 
print re.search("\d+", rating_text).group(0) # prints 4

注意，你是不是使用要求requests這裏 - 你可以用urllib2來代替，這只是一個例子。主要部分是解析html並獲得評分值。

希望有所幫助。

來源

2013-08-26 18:09:38 alecxe

偉大的解決方案......但是當你做的東西和stdlib'urllib2'和'xml.etree.ElementTree'一樣簡單的時候，爲什麼還需要額外的要求呢？（當然有'request'和/或'lxml.etree'的情況會更好，但這似乎不是其中之一。） – abarnert

@abarnert好點，謝謝。這只是一個例子，我們都是人類 - 所以，我們應該使用圖書館「人類」:) – alecxe

@abarnert好吧，評價值是非常好的「隱藏」在一個HTML中，通過xpath得到它看起來非常合乎邏輯和方便.. – alecxe

在python中從鏈接過濾信息？

回答

相關問題