使用httplib檢查一個URL是否會返回某個頁面？

我經歷了幾百bit.ly鏈接，看看他們是否被用來縮短鏈接。如果鏈接沒有，則返回this page。使用httplib檢查一個URL是否會返回某個頁面？

如何遍歷鏈接列表以檢查哪些不會返回此頁面？

我嘗試過使用this question中使用的頭部方法，但是當然這總是返回true。

我看着頭的方法，但發現它永遠不會返回任何數據：

>>> import httplib 
>>> conn = httplib.HTTPConnection("www.python.org") 
>>> conn.request("HEAD","/index.html") 
>>> res = conn.getresponse() 
>>> print res.status, res.reason 
200 OK 
>>> data = res.read() 
>>> print len(data) 
0 
>>> data == '' 
True

我難倒這一點，任何幫助將是巨大的。

來源

2014-03-02 Scherf

你想獲得頁面的內容？ –

我希望能夠在不加載頁面內容的情況下檢查鏈接，但如果這是唯一的方法，那麼可以這樣做 – Scherf

檢查res.status（例如，301是重定向） – jfs

如果bit.ly回報404非縮短鏈接HTTP代碼：

#!/usr/bin/env python 
from httplib import HTTPConnection 
from urlparse import urlsplit 

urls = ["http://bit.ly/NKEIV8", "http://bit.ly/1niCdh9"] 
for url in urls: 
    host, path = urlsplit(url)[1:3] 
    conn = HTTPConnection(host) 
    conn.request("HEAD", path) 
    r = conn.getresponse() 
    if r.status != 404: 
     print("{r.status} {url}".format(**vars()))

無關：爲加快檢查速度，您可以使用多個線程：

#!/usr/bin/env python 
from httplib import HTTPConnection 
from multiprocessing.dummy import Pool # use threads 
from urlparse import urlsplit 

def getstatus(url): 
    try: 
     host, path = urlsplit(url)[1:3] 
     conn = HTTPConnection(host) 
     conn.request("HEAD", path) 
     r = conn.getresponse() 
    except Exception as e: 
     return url, None, str(e) # error 
    else: 
     return url, r.status, None 

p = Pool(20) # use 20 concurrent connections 
for url, status, error in p.imap_unordered(getstatus, urls): 
    if status != 404: 
     print("{status} {url} {error}".format(**vars()))

來源

2014-03-02 22:40:44 jfs

你是正確的先生，謝謝你 – Scherf

這完美的作品。我遇到的問題是，當我從文件導入鏈接時，我忘記刪除換行符，並且因爲它們全部返回了200.很好的回答，謝謝 – Scherf

@Scherf：我添加了多線程版本。 – jfs

所以，這裏是一個簡單的方法來做到這一點：

import httplib2 
h = httplib2.Http(".cache") 
resp, content = h.request("http://www.python.org/", "GET") 
print content

來源：https://code.google.com/p/httplib2/wiki/Examples

來源

2014-03-02 22:25:16

使用httplib檢查一個URL是否會返回某個頁面？

回答

相關問題