隨機「IndexError：列表索引超出範圍」

我想抓取一個通過Javascript返回其數據的網站。我寫的使用BeautifulSoup的代碼刮我收到以下錯誤時效果很好，但在任意點：隨機「IndexError：列表索引超出範圍」

Traceback (most recent call last): 
File "scraper.py", line 48, in <module> 
accessible = accessible[0].contents[0] 
IndexError: list index out of range

有時候我可以湊4個網址，有時15，但在某些時候劇本最終失敗，給我上面的錯誤。我不能在失敗背後找到任何模式，所以我在這裏真的很茫然 - 我做錯了什麼？

from bs4 import BeautifulSoup 
import urllib 
import urllib2 
import jabba_webkit as jw 
import csv 
import string 
import re 
import time 

countries = csv.reader(open("countries.csv", 'rb'), delimiter=",") 
database = csv.writer(open("herdict_database.csv", 'w'), delimiter=',') 

basepage = "https://www.herdict.org/explore/" 
session_id = "indepth;jsessionid=C1D2073B637EBAE4DE36185564156382" 
ccode = "#fc=IN" 
end_date = "&fed=12/31/" 
start_date = "&fsd=01/01/" 

year_range = range(2009, 2011) 
years = [str(year) for year in year_range] 

def get_number(var): 
    number = re.findall("(\d+)", var) 

    if len(number) > 1: 
     thing = number[0] + number[1] 
    else: 
     thing = number[0] 

    return thing 

def create_link(basepage, session_id, ccode, end_date, start_date, year): 
    link = basepage + session_id + ccode + end_date + year + start_date + year 
    return link 



for ccode, name in countries: 
    for year in years: 
     link = create_link(basepage, session_id, ccode, end_date, start_date, year) 
     print link 
     html = jw.get_page(link) 
     soup = BeautifulSoup(html, "lxml") 

     accessible = soup.find_all("em", class_="accessible") 
     inaccessible = soup.find_all("em", class_="inaccessible") 

     accessible = accessible[0].contents[0] 
     inaccessible = inaccessible[0].contents[0] 

     acc_num = get_number(accessible) 
     inacc_num = get_number(inaccessible) 

     print acc_num 
     print inacc_num 
     database.writerow([name]+[year]+[acc_num]+[inacc_num]) 

     time.sleep(2)

來源

2013-01-24 LukasKawerau

您需要添加錯誤處理您的代碼。當抓取很多網站時，一些會變得不正常，或者不知何故被破壞。發生這種情況時，您將嘗試操縱空對象。

仔細查看代碼，找出假設它工作的所有假設，並檢查錯誤。爲解決具體關注

if not inaccessible or not accessible: 
    # malformed page 
    continue

來源

2013-01-24 20:14:16

+1解決了普遍關注。 – Wilduck

非常感謝！ :)絕對會將錯誤處理添加到我的代碼中，感謝您的建議！ – LukasKawerau

soup.find_all("em", class_="accessible")可能正在返回一個空的列表。你可以試試：

if accessible: 
    accessible = accessible[0].contents[0]

或者更一般地說：

if accessibe and inaccesible: 
    accessible = accessible[0].contents[0] 
    inaccessible = inaccessible[0].contents[0] 
else: 
    print 'Something went wrong!' 
    continue

來源

2013-01-24 20:11:08 root

+1：

對於該特定情況下，我會做到這一點。 – Wilduck

非常感謝，這幫助了很多:) – LukasKawerau

隨機「IndexError：列表索引超出範圍」

回答

相關問題