2016-04-01 131 views
1

我有以下代碼,我想出去一個網頁,並將所有相關的漫畫從網站上,並將它們存儲在我的電腦上。第一張圖片的下載效果很好,但似乎存在循環轉到網頁上的前幾頁的問題。如果任何人都可以看看代碼和幫助,將不勝感激。 我得到的錯誤是:卡在網絡抓取代碼

'Traceback (most recent call last): 
    File "C:\Users\528000\Desktop\kids print\Comic-gather.py", line 41, in <module 
> 
    prevLink = soup.select('a[class="prevLink"]')[0] 
'IndexError: list index out of range 


'import requests, os, bs4 
url = 'http://darklegacycomics.com' 
os.makedirs('darklegacy', exist_ok=True) 
while not url.endswith('#'): 
    # Download the page. 
    print('Downloading page %s...' % url) 
    res = requests.get(url) 
    res.raise_for_status() 

soup = bs4.BeautifulSoup(res.text) 
comicElem = soup.select('.comic img') 
if comicElem == []: 
    print('Could not find comic image.') 
else: 
    try: 
     comicUrl ='http://darklegacycomics.com' + comicElem[0].get('src') 
     # Download the image. 
     print('Downloading image %s...' % (comicUrl)) 
     res = requests.get(comicUrl) 
     res.raise_for_status() 
    except requests.exceptions.MissingSchema: 
     # skip this comic 
     prevLink = soup.select('.prevlink')[0] 
     url = 'http://darklegacycomics.com' + prevLink.get('href') 
     continue 
    # Save the image to ./darklegacy. 
    imageFile = open(os.path.join('darklegacy', os.path.basename(comicUrl)), 'wb') 
    for chunk in res.iter_content(100000): 
     imageFile.write(chunk) 
    imageFile.close() 

# Get the Prev button's url. 
prevLink = soup.select('a[class="prevLink"]')[0] 
url = 'http://darklegacycomics.com' + prevLink.get('href')'' 

回答

1

這將讓所有的圖像:

import requests, os, bs4 
from urlparse import urljoin 
url = 'http://darklegacycomics.com' 

soup = bs4.BeautifulSoup(requests.get(url).content) 

# get all img links where src value starts with /images 
links = soup.select(".comic img[src^=/image]") 


for img in links: 
    # extract the link 
    src = img["src"] 
    # use the image name as the file name 
    with open(os.path.basename(src),"w") as f: 
     # join the base an image url and write content to disk 
     f.write(requests.get(urljoin(url, src)).content)