我想抓取一個通過Javascript返回其數據的網站。我寫的使用BeautifulSoup的代碼刮我收到以下錯誤時效果很好,但在任意點:隨機「IndexError:列表索引超出範圍」
Traceback (most recent call last):
File "scraper.py", line 48, in <module>
accessible = accessible[0].contents[0]
IndexError: list index out of range
有時候我可以湊4個網址,有時15,但在某些時候劇本最終失敗,給我上面的錯誤。我不能在失敗背後找到任何模式,所以我在這裏真的很茫然 - 我做錯了什麼?
from bs4 import BeautifulSoup
import urllib
import urllib2
import jabba_webkit as jw
import csv
import string
import re
import time
countries = csv.reader(open("countries.csv", 'rb'), delimiter=",")
database = csv.writer(open("herdict_database.csv", 'w'), delimiter=',')
basepage = "https://www.herdict.org/explore/"
session_id = "indepth;jsessionid=C1D2073B637EBAE4DE36185564156382"
ccode = "#fc=IN"
end_date = "&fed=12/31/"
start_date = "&fsd=01/01/"
year_range = range(2009, 2011)
years = [str(year) for year in year_range]
def get_number(var):
number = re.findall("(\d+)", var)
if len(number) > 1:
thing = number[0] + number[1]
else:
thing = number[0]
return thing
def create_link(basepage, session_id, ccode, end_date, start_date, year):
link = basepage + session_id + ccode + end_date + year + start_date + year
return link
for ccode, name in countries:
for year in years:
link = create_link(basepage, session_id, ccode, end_date, start_date, year)
print link
html = jw.get_page(link)
soup = BeautifulSoup(html, "lxml")
accessible = soup.find_all("em", class_="accessible")
inaccessible = soup.find_all("em", class_="inaccessible")
accessible = accessible[0].contents[0]
inaccessible = inaccessible[0].contents[0]
acc_num = get_number(accessible)
inacc_num = get_number(inaccessible)
print acc_num
print inacc_num
database.writerow([name]+[year]+[acc_num]+[inacc_num])
time.sleep(2)
+1解決了普遍關注。 – Wilduck
非常感謝! :)絕對會將錯誤處理添加到我的代碼中,感謝您的建議! – LukasKawerau