我正在修改this script刮頁like this書頁圖像。直接從stackoverflow使用腳本,它會正確返回所有圖像,除了我想要的一個圖像。該頁面以空文件的形式返回,其標題如下:img.php?dir = 39d761947ad84e71e51e3c300f7af8ff & file = 1.png。刮圖片的頁面,但文件返回爲空
在我下面的修改版本中,我只是拉着書頁的圖像。
這裏是我的腳本:
from bs4 import BeautifulSoup as bs
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os
import sys
out_folder = '/Users/Craig/Desktop/img'
def main(url, out_folder):
soup = bs(urlopen(url))
parsed = list(urlparse.urlparse(url))
for image in soup.findAll('img', id='page_image'):
print "Image: %(src)s" % image
filename = image["src"].split("/")[-1]
parsed[2] = image["src"]
outpath = os.path.join(out_folder, filename)
if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlparse.urlunparse(parsed), outpath)
def _usage():
print "usage: python dumpimages.py http://example.com [outpath]"
if __name__ == "__main__":
url = sys.argv[-1]
if not url.lower().startswith("http"):
out_folder = sys.argv[-1]
url = sys.argv[-2]
if not url.lower().startswith("http"):
_usage()
sys.exit(-1)
main(url, out_folder)
任何想法?
看起來不錯。謝謝! –