在我的python應用程序中,我必須閱讀許多網頁才能收集數據。爲了減少http調用,我想只提取更改後的頁面。我的問題是,我的代碼總是告訴我,頁面已被更改(代碼200),但實際上它不是。檢測網頁是否發生變化
這是我的代碼:
from models import mytab
import re
import urllib2
from wsgiref.handlers import format_date_time
from datetime import datetime
from time import mktime
def url_change():
urls = mytab.objects.all()
# this is some urls:
# http://www.venere.com/it/pensioni/venezia/pensione-palazzo-guardi/#reviews
# http://www.zoover.it/italia/sardegna/cala-gonone/san-francisco/hotel
# http://www.orbitz.com/hotel/Italy/Venice/Palazzo_Guardi.h161844/#reviews
# http://it.hotels.com/ho292636/casa-del-miele-susegana-italia/
# http://www.expedia.it/Venezia-Hotel-Palazzo-Guardi.h1040663.Hotel-Information#reviews
# ...
for url in urls:
request = urllib2.Request(url.url)
if url.last_date == None:
now = datetime.now()
stamp = mktime(now.timetuple())
url.last_date = format_date_time(stamp)
url.save()
request.add_header("If-Modified-Since", url.last_date)
try:
response = urllib2.urlopen(request) # Make the request
# some actions
now = datetime.now()
stamp = mktime(now.timetuple())
url.last_date = format_date_time(stamp)
url.save()
except urllib2.HTTPError, err:
if err.code == 304:
print "nothing...."
else:
print "Error code:", err.code
pass
我不明白出了什麼問題。誰能幫我?
您是否考慮過網頁可能必須說謊日期的事實? – 2013-03-04 17:25:46
@宇宙公主不,我沒有考慮過這個。那麼可以做些什麼來檢查頁面是否發生了變化?我也嘗試'散列',但每次加載時頁面都會更改。 – RoverDar 2013-03-04 17:35:32