獲取圖像的寬度,我可以得到使用BeautifulSoup
如下圖像的width
屬性:從HTML代碼
img = soup.find("img")
width = img["width"]
的問題是,width
可以CSS
文件中設置或根本沒有設置。
我想在不從img["src"]
下載圖像的情況下提取該值如何在Python中將其設置爲某處(HTML或CSS)時提取該值或獲取瀏覽器將呈現的默認值(如果未設置)?
獲取圖像的寬度,我可以得到使用BeautifulSoup
如下圖像的width
屬性:從HTML代碼
img = soup.find("img")
width = img["width"]
的問題是,width
可以CSS
文件中設置或根本沒有設置。
我想在不從img["src"]
下載圖像的情況下提取該值如何在Python中將其設置爲某處(HTML或CSS)時提取該值或獲取瀏覽器將呈現的默認值(如果未設置)?
您可以下載部分圖像,只夠通過設置在請求頭 範圍得到寬/高和使用getimageinfo.py
用法示例莫名其妙的變體:
def check_is_small_pic(url, pic_size):
is_small = False
r_check = requests.get(url, headers={"Range": "50"})
image_info = getimageinfo.getImageInfo(r_check.content)
if image_info[1] < pic_size or image_info[2] < pic_size:
is_small = True
return is_small
一些getimageinfo.py ,迅速調整爲蟒蛇3.5:
import io
import struct
# import urllib.request as urllib2
def getImageInfo(data):
data = data
size = len(data)
#print(size)
height = -1
width = -1
content_type = ''
# handle GIFs
if (size >= 10) and data[:6] in (b'GIF87a', b'GIF89a'):
# Check to see if content_type is correct
content_type = 'image/gif'
w, h = struct.unpack(b"<HH", data[6:10])
width = int(w)
height = int(h)
# See PNG 2. Edition spec (http://www.w3.org/TR/PNG/)
# Bytes 0-7 are below, 4-byte chunk length, then 'IHDR'
# and finally the 4-byte width, height
elif ((size >= 24) and data.startswith(b'\211PNG\r\n\032\n')
and (data[12:16] == b'IHDR')):
content_type = 'image/png'
w, h = struct.unpack(b">LL", data[16:24])
width = int(w)
height = int(h)
# Maybe this is for an older PNG version.
elif (size >= 16) and data.startswith(b'\211PNG\r\n\032\n'):
# Check to see if we have the right content type
content_type = 'image/png'
w, h = struct.unpack(b">LL", data[8:16])
width = int(w)
height = int(h)
# handle JPEGs
elif (size >= 2) and data.startswith(b'\377\330'):
content_type = 'image/jpeg'
jpeg = io.BytesIO(data)
jpeg.read(2)
b = jpeg.read(1)
try:
while (b and ord(b) != 0xDA):
while (ord(b) != 0xFF): b = jpeg.read(1)
while (ord(b) == 0xFF): b = jpeg.read(1)
if (ord(b) >= 0xC0 and ord(b) <= 0xC3):
jpeg.read(3)
h, w = struct.unpack(b">HH", jpeg.read(4))
break
else:
jpeg.read(int(struct.unpack(b">H", jpeg.read(2))[0])-2)
b = jpeg.read(1)
width = int(w)
height = int(h)
except struct.error:
pass
except ValueError:
pass
return content_type, width, height
# from PIL import Image
# import requests
# hrefs = ['http://farm4.staticflickr.com/3894/15008518202_b016d7d289_m.jpg','https://farm4.staticflickr.com/3920/15008465772_383e697089_m.jpg','https://farm4.staticflickr.com/3902/14985871946_86abb8c56f_m.jpg']
# RANGE = 5000
# for href in hrefs:
# req = requests.get(href,headers={'User-Agent':'Mozilla5.0(Google spider)','Range':'bytes=0-{}'.format(RANGE)})
# im = getImageInfo(req.content)
#
# print(im)
# req = urllib2.Request("http://vn-sharing.net/forum/images/smilies/onion/ngai.gif", headers={"Range": "5000"})
# r = urllib2.urlopen(req)
#
# f = open("D:\\Pictures\\1.jpg", "rb")
# print(getImageInfo(r.read()))
# Output: >> ('image/gif', 50, 50)
# print(getImageInfo(f.read()))
答:你不能。 'BeautifulSoup'不是一個完整的瀏覽器。 – Jan
http://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python –