2016-09-16 86 views
1

獲取圖像的寬度,我可以得到使用BeautifulSoup如下圖像的width屬性:從HTML代碼

img = soup.find("img") 
width = img["width"] 

的問題是,width可以CSS文件中設置或根本沒有設置。

我想在不從img["src"]下載圖像的情況下提取該值如何在Python中將其設置爲某處(HTML或CSS)時提取該值或獲取瀏覽器將呈現的默認值(如果未設置)?

+1

答:你不能。 'BeautifulSoup'不是一個完整的瀏覽器。 – Jan

+0

http://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python –

回答

2

您可以下載部分圖像,只夠通過設置在請求頭 範圍得到寬/高和使用getimageinfo.py

用法示例莫名其妙的變體:

def check_is_small_pic(url, pic_size): 
    is_small = False 
    r_check = requests.get(url, headers={"Range": "50"}) 
    image_info = getimageinfo.getImageInfo(r_check.content) 
    if image_info[1] < pic_size or image_info[2] < pic_size: 
     is_small = True 
    return is_small 

一些getimageinfo.py ,迅速調整爲蟒蛇3.5:

import io 
import struct 
# import urllib.request as urllib2 

def getImageInfo(data): 
    data = data 
    size = len(data) 
    #print(size) 
    height = -1 
    width = -1 
    content_type = '' 

    # handle GIFs 
    if (size >= 10) and data[:6] in (b'GIF87a', b'GIF89a'): 
     # Check to see if content_type is correct 
     content_type = 'image/gif' 
     w, h = struct.unpack(b"<HH", data[6:10]) 
     width = int(w) 
     height = int(h) 

    # See PNG 2. Edition spec (http://www.w3.org/TR/PNG/) 
    # Bytes 0-7 are below, 4-byte chunk length, then 'IHDR' 
    # and finally the 4-byte width, height 
    elif ((size >= 24) and data.startswith(b'\211PNG\r\n\032\n') 
      and (data[12:16] == b'IHDR')): 
     content_type = 'image/png' 
     w, h = struct.unpack(b">LL", data[16:24]) 
     width = int(w) 
     height = int(h) 

    # Maybe this is for an older PNG version. 
    elif (size >= 16) and data.startswith(b'\211PNG\r\n\032\n'): 
     # Check to see if we have the right content type 
     content_type = 'image/png' 
     w, h = struct.unpack(b">LL", data[8:16]) 
     width = int(w) 
     height = int(h) 

    # handle JPEGs 
    elif (size >= 2) and data.startswith(b'\377\330'): 
     content_type = 'image/jpeg' 
     jpeg = io.BytesIO(data) 
     jpeg.read(2) 
     b = jpeg.read(1) 
     try: 
      while (b and ord(b) != 0xDA): 
       while (ord(b) != 0xFF): b = jpeg.read(1) 
       while (ord(b) == 0xFF): b = jpeg.read(1) 
       if (ord(b) >= 0xC0 and ord(b) <= 0xC3): 
        jpeg.read(3) 
        h, w = struct.unpack(b">HH", jpeg.read(4)) 
        break 
       else: 
        jpeg.read(int(struct.unpack(b">H", jpeg.read(2))[0])-2) 
       b = jpeg.read(1) 
      width = int(w) 
      height = int(h) 
     except struct.error: 
      pass 
     except ValueError: 
      pass 

    return content_type, width, height 



# from PIL import Image 
# import requests 
# hrefs = ['http://farm4.staticflickr.com/3894/15008518202_b016d7d289_m.jpg','https://farm4.staticflickr.com/3920/15008465772_383e697089_m.jpg','https://farm4.staticflickr.com/3902/14985871946_86abb8c56f_m.jpg'] 
# RANGE = 5000 
# for href in hrefs: 
#  req = requests.get(href,headers={'User-Agent':'Mozilla5.0(Google spider)','Range':'bytes=0-{}'.format(RANGE)}) 
#  im = getImageInfo(req.content) 
# 
#  print(im) 
# req = urllib2.Request("http://vn-sharing.net/forum/images/smilies/onion/ngai.gif", headers={"Range": "5000"}) 
# r = urllib2.urlopen(req) 
# 
# f = open("D:\\Pictures\\1.jpg", "rb") 
# print(getImageInfo(r.read())) 
# Output: >> ('image/gif', 50, 50) 
# print(getImageInfo(f.read())) 
2

快速的回答是:你不能 - 圖像的結果大小是基於對CSS的評估,實際上是JS。你需要做所有這些工作才能找到答案。

另一種方法可能是使用真實的瀏覽器爲你做這件事,然後問它是什麼寬度。請參閱PhantomJSSelenium

+0

添加一個如何使用你所推薦的例子將是一個好主意。 –