如何檢查網站上的值是否已更改

基本上，我試圖運行一些代碼（Python 3.2），如果網站上的值發生更改，否則稍等一會，稍後再檢查。如何檢查網站上的值是否已更改

首先，我想我可以將該值保存在一個變量中，並將其與下一次腳本運行時獲取的新值進行比較。但是，當腳本再次運行並初始化該變量時，這個值很快就會遇到問題。

因此，我試着將網頁的html保存爲一個文件，然後將其與下一次腳本運行時調用的html進行比較。因爲即使沒有變化，它也不會出現False。

接下來是酸洗網頁，然後嘗試與html進行比較。有趣的是，這在腳本中不起作用。但是，如果在腳本運行後輸入file = pickle.load（打開（'D：\ Download \ htmlString.p'，'rb'）），然後輸入文件== html，則在沒有任何變化。

我有點困惑，爲什麼它不會在腳本運行時工作，但如果我這樣做，它顯示正確的答案。

編輯：感謝迄今爲止的迴應傢伙。我所擁有的問題並不是真正的其他解決方法（儘管學習更多方法來完成任務總是很好的！），而是爲什麼下面的代碼在作爲腳本運行時不起作用，但是如果我腳本運行後，在提示處重新加載pickle對象，然後使用html進行測試，如果沒有任何更改，它將返回True。

try: 
    file = pickle.load(open('D:\\Download\\htmlString.p', 'rb')) 
    if pickle.load(open('D:\\Download\\htmlString.p', 'rb')) == htmlString: 
     print("Values haven't changed!") 
     sys.exit(0) 
    else: 
     pickle.dump(htmlString, open('D:\\Download\\htmlString.p', "wb")) 
     print('Saving') 
except: 
    pickle.dump(htmlString, open('D:\\Download\\htmlString.p', "wb")) 
    print('ERROR')

來源

2012-06-28 Jason White

遠程和本地內容的內容/ mimetype是什麼？ – DeaconDesperado

保存和比較整個頁面將會非常低效。你可以計算一個像md5這樣的散列並保存。如果將來哈希匹配，那麼頁面沒有改變。 – TJD

我已更新我的答案以解決您的編輯問題。那是你在找什麼？ – Phil

編輯：我沒有意識到你只是在尋找與你的腳本的問題。這就是我認爲的問題，其次是我的原始答案，它解決了您嘗試解決更大問題的另一種方法。

你的腳本是使用毯子except聲明的危險的一個很好的例子：你抓住了一切。其中包括您的sys.exit(0)。

我假設你是try區塊是否存在D:\Download\htmlString.p尚不存在的情況。該錯誤被稱爲IOError，你可以用except IOError:

這是你的腳本加上一點讓它走之前代碼專門捕獲它，固定你的except問題：

import sys 
import pickle 
import urllib2 

request = urllib2.Request('http://www.iana.org/domains/example/') 
response = urllib2.urlopen(request) # Make the request 
htmlString = response.read() 

try: 
    file = pickle.load(open('D:\\Download\\htmlString.p', 'rb')) 
    if file == htmlString: 
     print("Values haven't changed!") 
     sys.exit(0) 
    else: 
     pickle.dump(htmlString, open('D:\\Download\\htmlString.p', "wb")) 
     print('Saving') 
except IOError: 
    pickle.dump(htmlString, open('D:\\Download\\htmlString.p', "wb")) 
    print('Created new file.')

作爲一個側面說明，您可以考慮使用os.path作爲您的文件路徑 - 它可以幫助以後任何人想要在其他平臺上使用您的腳本，並且可以爲您節省難看的雙反斜槓。

編輯2：改編爲您的特定網址。

該頁面上的廣告有一個動態生成的編號，隨每個頁面加載而變化。在完成所有內容之後，它接近尾聲，所以我們可以在此時分割HTML字符串，並取前半部分，丟棄具有動態數字的部分。

import sys 
import pickle 
import urllib2 

request = urllib2.Request('http://ecal.forexpros.com/e_cal.php?duration=weekly') 
response = urllib2.urlopen(request) # Make the request 
# Grab everything before the dynabic double-click link 
htmlString = response.read().split('<iframe src="http://fls.doubleclick')[0] 

try: 
    file = pickle.load(open('D:\\Download\\htmlString.p', 'r')) 
    if pickle.load(open('D:\\Download\\htmlString.p', 'r')) == htmlString: 
     print("Values haven't changed!") 
     sys.exit(0) 
    else: 
     pickle.dump(htmlString, open('D:\\Download\\htmlString.p', "w")) 
     print('Saving') 
except IOError: 
    pickle.dump(htmlString, open('D:\\Download\\htmlString.p', "w")) 
    print('Created new file.')

你的字符串不是有效的HTML文檔了，如果這是非常重要的。如果是這樣，你可能只是刪除該行或其他東西。這可能是一種更優雅的方式，可能會用正則表達式刪除數字 - 但這至少可以滿足您的問題。

原答覆 - 解決問題的替代方法。

Web服務器的響應頭文件是什麼樣的？ HTTP指定了一個Last-Modified屬性，您可以使用該屬性來檢查內容是否已更改（假設服務器說實話）。在Uku在他的回答中顯示的請求中使用這個請求，請求HEAD。如果你想節省帶寬並且對你所投票的服務器很好。

還有一個If-Modified-Since標題，這聽起來像你可能要找的東西。

如果我們將它們組合起來，你可能會想出這樣的事情：

import sys 
import os.path 
import urllib2 

url = 'http://www.iana.org/domains/example/' 
saved_time_file = 'last time check.txt' 

request = urllib2.Request(url) 
if os.path.exists(saved_time_file): 
    """ If we've previously stored a time, get it and add it to the request""" 
    last_time = open(saved_time_file, 'r').read() 
    request.add_header("If-Modified-Since", last_time) 

try: 
    response = urllib2.urlopen(request) # Make the request 
except urllib2.HTTPError, err: 
    if err.code == 304: 
     print "Nothing new." 
     sys.exit(0) 
    raise # some other http error (like 404 not found etc); re-raise it. 

last_modified = response.info().get('Last-Modified', False) 
if last_modified: 
    open(saved_time_file, 'w').write(last_modified) 
else: 
    print("Server did not provide a last-modified property. Continuing...") 
    """ 
    Alternately, you could save the current time in HTTP-date format here: 
    http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.3 
    This might work for some servers that don't provide Last-Modified, but do 
    respect If-Modified-Since. 
    """ 

""" 
You should get here if the server won't confirm the content is old. 
Hopefully, that means it's new. 
HTML should be in response.read(). 
"""

而且check out this blog post由STII可以提供一些啓示。我不太瞭解ETags將它們放入我的示例中，但他的代碼也會檢查它們。

來源

2012-06-28 23:01:56 Phil

我在編寫此答案時也錯過了編輯...答案＃2即將到來。 – Phil

嘿菲爾，感謝您指出關於sys.exit的消息，因爲我不知道它提出了一個錯誤來退出腳本。關於我原來的問題，雖然沒有解決。由於一些未知的原因，即使它應該，它仍然不會打印真正的，除非我重新加載pickle對象然後測試是否相等。雖然謝謝！ –

嗯，那很奇怪。它似乎對我來說工作正常：第一次運行它說「創建新文件」，然後正確地「值沒有改變！」或「保存」。我在我控制的服務器上測試過它。什麼是您使用的URL？是你自己還是別人的？也許這是某種平臺特定的。我在這裏運行linux。 – Phil

這將是更有效地做一個HEAD請求，並檢查文件的內容長度。

import urllib2 
""" 
read old length from file into variable 
""" 
request = urllib2.Request('http://www.yahoo.com') 
request.get_method = lambda : 'HEAD' 

response = urllib2.urlopen(request) 
new_length = response.info()["Content-Length"] 
if old_length != new_length: 
    print "something has changed"

請注意，這是不可能的，雖然有可能是內容長度將是完全一樣的，但同時也是最有效的方式。此方法可能適合或不適合，具體取決於您期望的更改類型。

來源

2012-06-28 20:59:16

俏皮。雖然問題標題似乎暗示他正在檢查頁面上的特定值，所以如果它只是一個整數或者某個東西，那麼內容長度沒有改變的機會就會更高。 – Phil

通過對兩者的內容進行散列處理，您總是可以知道本地存儲文件與遠程數據之間的任何更改。這通常用於驗證下載數據的準確性。對於連續檢查，您將需要一個while循環。

import hashlib 
import urllib 

num_checks = 20 
last_check = 1 
while last_check != num_checks: 
    remote_data = urllib.urlopen('http://remoteurl').read() 
    remote_hash = hashlib.md5(remote_data).hexdigest() 

    local_data = open('localfilepath').read() 
    local_hash = hashlib.md5(local_data).hexdigest() 
    if remote_hash == local_hash: 
    print 'right now, we match!' 
    else: 
    print 'right now, we are different'

如果實際數據永遠不需要保存在本地，我永遠只能存儲MD5哈希和檢查時，計算它的飛行。

來源

2012-06-28 21:01:44 DeaconDesperado

我並不完全清楚您是否想要查看網站是否發生了變化，或者您是否要對網站的數據做更多工作。如前所述，如前所述，肯定是哈希。這是一個工作（在mac上的python 2.6.1）的例子，比較完整的舊html和新的html;根據需要修改它應該很容易，因此它使用散列或只是網站的特定部分。希望評論和文檔可以使一切變得清晰。

import urllib2 

def getFilename(url): 
    ''' 
    Input: url 
    Return: a (string) filename to be used later for storing the urls contents 
    ''' 
    return str(url).lstrip('http://').replace("/",":")+'.OLD' 


def getOld(url): 
    ''' 
    Input: url- a string containing a url 
    Return: a string containing the old html, or None if there is no old file 
    (checks if there already is a url.OLD file, and make an empty one if there isn't to handle the case that this is the first run) 
    Note: the file created with the old html is the format url(with : for /).OLD 
    ''' 
    oldFilename = getFilename(url) 
    oldHTML = "" 
    try: 
     oldHTMLfile = open(oldFilename,'r') 
    except: 
     # file doesn't exit! so make it 
     with open(oldFilename,'w') as oldHTMLfile: 
      oldHTMLfile.write("") 
     return None 
    else: 
     oldHTML = oldHTMLfile.read() 
     oldHTMLfile.close() 

    return oldHTML 

class ConnectionError(Exception): 
    def __init__(self, value): 
     if type(value) != type(''): 
      self.value = str(value) 
     else: 
      self.value = value 
    def __str__(self): 
     return 'ConnectionError: ' + self.value  


def htmlHasChanged(url): 
    ''' 
    Input: url- a string containing a url 
    Return: a boolean stating whether the website at url has changed 
    ''' 

    try: 
     fileRecvd = urllib2.urlopen(url).read() 
    except: 
     print 'Could not connect to %s, sorry!' % url 
     #handle bad connection error... 
     raise ConnectionError("urlopen() failed to open " + str(url)) 
    else: 
     oldHTML = getOld(url) 
     if oldHTML == fileRecvd: 
      hasChanged = False 
     else: 
      hasChanged = True 

     # rewrite file 
     with open(getFilename(url),'w') as f: 
      f.write(fileRecvd) 

     return hasChanged 

if __name__ == '__main__': 
    # test it out with whatismyip.com 
    try: 
     print htmlHasChanged("http://automation.whatismyip.com/n09230945.asp") 
    except ConnectionError,e: 
     print e

來源

2012-06-28 22:03:28

糟糕，在發帖之前沒有看到原始問題的編輯... –

如何檢查網站上的值是否已更改

回答

相關問題