從網頁上傳圖片

我想實現類似於此的功能http://www.tineye.com/parse?url=yahoo.com - 允許用戶從任何網頁上傳圖像。從網頁上傳圖片

對我來說主要的問題是大量圖像的網頁需要花費太多時間。

我（使用捲曲或urllib的）根據未來計劃在Django這樣做：頁面

抓鬥HTML（花費大網頁約1秒）：

file = urllib.urlopen(requested_url) 
html_string = file.read()

使用HTML解析器（BeautifulSoup）解析它，查找img標籤，並將所有src圖像寫入列表。（大頁面也需要大約1秒）
檢查我列表中所有圖像的大小，如果它們足夠大，則返回它們的json響應（需要非常長的時間約15秒，當圖像上有大約80個圖像時網頁）。下面是函數的代碼：


def get_image_size(uri): 
    file = urllib.urlopen(uri) 
    p = ImageFile.Parser() 
    data = file.read(1024) 
    if not data: 
     return None 
    p.feed(data) 
    if p.image: 
     return p.image.size 
    file.close() 
    #not an image 
    return None

正如你所看到的，我不加載完整的圖像來獲得它的大小，只有它的1KB。但是當有很多圖像時（我爲每個找到的圖像調用一次這個函數），它仍然需要太多時間。

那麼我該如何讓它工作得更快？

可能有沒有辦法對每張圖片做出請求？

任何幫助將不勝感激。

謝謝！

來源

2011-04-09 gleb.pitsevich

什麼只是檢查在HTTP響應內容長度？ – tmg 2011-04-09 19:22:36

是的，我考慮過它，但是我想只顯示取決於寬度和高度的圖像（例如寬度或高度超過100像素），並且僅知道內容長度很難做到。 – 2011-04-09 22:00:17

我能想到的幾個最佳化的：

解析因爲你是從流中讀取一個文件
使用SAX解析器（這將是偉大的上述點）
使用HEAD獲取圖像的大小
使用隊列來放置您的圖像，然後使用幾個線程連接並獲取文件大小

HEAD請求的

例如：

$ telnet m.onet.pl 80 
Trying 213.180.150.45... 
Connected to m.onet.pl. 
Escape character is '^]'. 
HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1 
host: m.onet.pl 

HTTP/1.0 200 OK 
Server: nginx/0.8.53 
Date: Sat, 09 Apr 2011 18:32:44 GMT 
Content-Type: image/jpeg 
Content-Length: 37545 
Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT 
Expires: Sat, 16 Apr 2011 18:32:44 GMT 
Cache-Control: max-age=604800 
Accept-Ranges: bytes 
Age: 6575 
X-Cache: HIT from emka1.m10r2.onet 
Via: 1.1 emka1.m10r2.onet:80 (squid) 
Connection: close 

Connection closed by foreign host.

來源

2011-04-09 20:23:12 Jerzyk

感謝提醒我的線程！現在一切工作都以可接受的速度進行（約30次請求的速度提高了10倍）。標記爲已接受！ – 2011-04-11 18:57:50

你可以像使用urllib2.urlopen（我不知道urllib）返回的對象那樣使用文件的headers屬性。

這是我爲它寫的一個測試。正如你所看到的那樣，它速度很快，但我想有些網站會阻止太多的重複請求。

|milo|laurie|¥ cat test.py 
import urllib2 
uri = "http://download.thinkbroadband.com/1GB.zip" 

def get_file_size(uri): 
    file = urllib2.urlopen(uri) 
    content_header, = [header for header in file.headers.headers if header.startswith("Content-Length")] 
    _, str_length = content_header.split(':') 
    length = int(str_length.strip()) 
    return length 

if __name__ == "__main__": 
    get_file_size(uri) 
|milo|laurie|¥ time python2 test.py 
python2 test.py 0.06s user 0.01s system 35% cpu 0.196 total

來源

2011-04-09 20:21:33

從網頁上傳圖片

回答

相關問題