2017-07-25 93 views
0

我努力學習報廢,錯誤異常刮痧

當我使用異常低了下去代碼通過錯誤,因爲他們不影響數據寫入到csv

我不斷收到一個「socket.gaierror」,但在處理中有一個「urllib.error.URLError」處理,我得到「NameError:名稱'套接字'沒有定義」似乎迂迴

我有點理解使用這些異常可能不是運行代碼的最佳方式,但我似乎無法通過這些錯誤,我不知道解決方法或如何f ix錯誤。

如果您在修復錯誤異常之外有任何建議,我們將不勝感激。

import csv 
from urllib.request import urlopen 
from urllib.error import HTTPError 
from bs4 import BeautifulSoup 

base_url = 'http://www.fangraphs.com/' # used in line 27 for concatenation 
years = ['2017','2016','2015'] # for enough data to run tests 

#Getting Links for letters 
player_urls = [] 
data = urlopen('http://www.fangraphs.com/players.aspx') 
soup = BeautifulSoup(data, "html.parser") 
for link in soup.find_all('a'): 
     if link.has_attr('href'): 
      player_urls.append(base_url + link['href']) 

#Getting Alphabet Links 
test_for_playerlinks = 'players.aspx?letter=' 
player_alpha_links = [] 
for i in player_urls: 
    if test_for_playerlinks in i: 
     player_alpha_links.append(i) 

# Getting Player Links 
ind_player_urls = [] 
for l in player_alpha_links: 
    data = urlopen(l) 
    soup = BeautifulSoup(data, "html.parser") 
    for link in soup.find_all('a'): 
     if link.has_attr('href'): 
      ind_player_urls.append(link['href']) 

#Player Links 
jan = 'statss.aspx?playerid' 
players = [] 
for j in ind_player_urls: 
    if jan in j: 
     players.append(j) 

# Building Pitcher List 
pitcher = 'position=P' 
pitchers = [] 
pos_players = [] 
for i in players: 
    if pitcher in i: 
     pitchers.append(i) 
    else: 
     pos_players.append(i) 

# Individual Links to Different Tables Sorted by Base URL differences 
splits = 'http://www.fangraphs.com/statsplits.aspx?' 
game_logs = 'http://www.fangraphs.com/statsd.aspx?' 
split_pp = [] 
gamel = [] 
years = ['2017','2016','2015'] 
for i in pos_players: 
    for year in years: 
     split_pp.append(splits + i[12:]+'&season='+ year) 
     gamel.append(game_logs+ i[12:] + '&type=&gds=&gde=&season=' + year) 

split_pitcher = [] 
gl_pitcher = [] 
for i in pitchers: 
    for year in years: 
     split_pitcher.append(splits + i[12:]+'&season=' + year) 
     gl_pitcher.append(game_logs + i[12:] + '&type=&gds=&gde=&season=' + year) 

# Splits for Pitcher Data 
row_sp = [] 
rows_sp = [] 
try:  
    for i in split_pitcher: 
     sauce = urlopen(i) 
     soup = BeautifulSoup(sauce, "html.parser") 
     table1 = soup.find_all('strong', {"style":"font-size:15pt;"}) 
     row_sp = [] 
     for name in table1: 
      nam = name.get_text() 
      row_sp.append(nam) 
     table = soup.find_all('table', {"class":"rgMasterTable"}) 
     for h in table: 
      he = h.find_all('tr') 
      for i in he: 
       td = i.find_all('td') 
       for j in td: 
        row_sp.append(j.get_text()) 
      rows_sp.append(row_sp) 
except(RuntimeError, TypeError, NameError, URLError, socket.gaierror): 
    pass 

try: 
    with open('SplitsPitchingData2.csv', 'w') as fp: 
     writer = csv.writer(fp) 
     writer.writerows(rows_sp) 
except(RuntimeError, TypeError, NameError): 
    pass 
+0

你可以粘貼完整的回溯。 – jlaur

+0

關於你的代碼應該做什麼,也許一兩句話? – jlaur

回答

1

我猜你的主要問題是,你 - 沒有任何睡眠什麼那麼 - 查詢了一個巨大的無效網址的量的網站(創建3個網址爲2015-2017年爲22880名投手總數,但其中大多數不在該範圍內,因此您有數以萬計的查詢返回錯誤)。

我很驚訝你的IP沒有被網站管理員禁止。這就是說:最好做一些過濾,以避免所有這些錯誤查詢...

我應用的過濾器並不完美。它會檢查列表中的年份是否出現在網站上給出的年份的開始或結束時間(例如'2004 - 2015')。這也創建了錯誤鏈接,但沒有辦法接近原始腳本的數量。

在代碼中它可能看起來像這樣:

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
from time import sleep 
import csv 

base_url = 'http://www.fangraphs.com/' 
years = ['2017','2016','2015'] 

# Getting Links for letters 
letter_links = [] 
data = urlopen('http://www.fangraphs.com/players.aspx') 
soup = BeautifulSoup(data, "html.parser") 
for link in soup.find_all('a'): 
    try: 
     link = base_url + link['href'] 
     if 'players.aspx?letter=' in link: 
      letter_links.append(link) 
    except: 
     pass 
print("[*] Retrieved {} links. Now fetching content for each...".format(len(letter_links))) 


# the data resides in two different base_urls: 
splits_url = 'http://www.fangraphs.com/statsplits.aspx?' 
game_logs_url = 'http://www.fangraphs.com/statsd.aspx?' 

# we need (for some reason) players in two lists - pitchers_split and pitchers_game_log - and the rest of the players in two different, pos_players_split and pis_players_game_log 
pos_players_split = [] 
pos_players_game_log = [] 
pitchers_split = [] 
pitchers_game_log = [] 

# and if we wanted to do something with the data from the letter_queries, lets put that in a list for safe keeping: 
ind_player_urls = [] 
current_letter_count = 0 
for link in letter_links: 
    current_letter_count +=1 
    data = urlopen(link) 
    soup = BeautifulSoup(data, "html.parser") 
    trs = soup.find('div', class_='search').find_all('tr') 
    for player in trs: 
     player_data = [tr.text for tr in player.find_all('td')] 
     # To prevent tons of queries to fangraph with invalid years - check if elements from years list exist with the player stat: 
     if any(year in player_data[1] for year in years if player_data[1].startswith(year) or player_data[1].endswith(year)): 
      href = player.a['href'] 
      player_data.append(base_url + href) 
      # player_data now looks like this: 
      # ['David Aardsma', '2004 - 2015', 'P', 'http://www.fangraphs.com/statss.aspx?playerid=1902&position=P'] 
      ind_player_urls.append(player_data) 
      # build the links for game_log and split 
      for year in years: 
       split = '{}{}&season={}'.format(splits_url,href[12:],year) 
       game_log = '{}{}&type=&gds=&gde=&season={}'.format(game_logs_url, href[12:], year)    
       # checking if the player is pitcher or not. We're append both link and name (player_data[0]), so we don't need to extract name later on 
       if 'P' in player_data[2]: 
        pitchers_split.append([player_data[0],split]) 
        pitchers_game_log.append([player_data[0],game_log]) 
       else: 
        pos_players_split.append([player_data[0],split]) 
        pos_players_game_log.append([player_data[0],game_log])    

    print("[*] Done extracting data for players for letter {} out of {}".format(current_letter_count, len(letter_links))) 
    sleep(2) 
    # CONSIDER INSERTING CSV-PART HERE.... 


# Extracting and writing pitcher data to file 
with open('SplitsPitchingData2.csv', 'a') as fp: 
    writer = csv.writer(fp) 
    for i in pitchers_split: 
     try: 
      row_sp = [] 
      rows_sp = [] 
      # all elements in the pitchers_split are lists. Player name is i[1] 
      data = urlopen(i[1]) 
      soup = BeautifulSoup(data, "html.parser") 
      # append name to row_sp from pitchers_split 
      row_sp.append(i[0]) 
      # the page has 3 tables with the class rgMasterTable, the first i Standard, the second Advanced, the 3rd Batted Ball 
      # we're only grabbing standard 
      table_standard = soup.find_all('table', {"class":"rgMasterTable"})[0] 
      trs = table_standard.find_all('tr') 
      for tr in trs: 
       td = tr.find_all('td') 
       for content in td: 
        row_sp.append(content.get_text()) 
      rows_sp.append(row_sp) 
      writer.writerows(rows_sp)  
      sleep(2) 
     except Exception as e: 
      print(e) 
      pass 

因爲我不知道你如何精確想格式化輸出,你需要對一些工作數據。

如果您希望避免在檢索實際投手統計數據之前等待所有letter_links被提取出來(並且對輸出進行微調),則可以將csv writer部分向上移動,以便作爲字母循環的一部分運行。如果你這樣做,不要忘記在獲取另一個letter_link之前清空pitchers_split列表...