2017-10-04 67 views
-1

我試圖從多個氣象站獲取多年的小時數據,並將它放入熊貓數據框中。我不能使用API​​,因爲請求有限制,我不想支付數千美元來獲取這些數據。循環遍歷python中的多個url請求的值列表

我可以從腳本中獲取所需的數據。當我嘗試對其進行修改,使其循環遍歷站點列表時,我得到一個406錯誤,或者它只返回來自列表中第一個站點的數據。我怎樣才能遍歷所有的電臺?另外,如何存儲站名以便將其添加到另一列的數據框中?

這裏是我的代碼看起來像現在:

stations = ['EGMC','KSAT','CAHR'] 


weather_data = [] 
date = [] 
for s in stations: 
    for y in range(2014,2015): 
     for m in range(1, 13): 
      for d in range(1, 32): 
      #check if a leap year 
       if y%400 == 0: 
        leap = True 
       elif y%100 == 0: 
        leap = False 
       elif y%4 == 0: 
        leap = True 
       else: 
        leap = False 

      #check to see if dates have already been scraped  

      if (m==2 and leap and d>29): 
       continue 
      elif (y==2013 and m==2 and d > 28): 
       continue 
      elif(m in [4, 6, 9, 11] and d > 30): 
       continue 

      timestamp = str(y) + str(m) + str(d) 
      print ('Getting data for ' + timestamp) 

#pull URL 
      url = 'http://www.wunderground.com/history/airport/{0}/' + str(y) + '/' + str(m) + '/' + str(d) + '/DailyHistory.html?HideSpecis=1'.format(stations) 
      page = urlopen(url) 

     #find the correct piece of data on the page 
      soup = BeautifulSoup(page, 'lxml') 



      for row in soup.select("table tr.no-metars"): 
       date.append(str(y) + '/' + str(m) + '/' + str(d)) 
       cells = [cell.text.strip().encode('ascii', 'ignore').decode('ascii') for cell in row.find_all('td')] 
       weather_data.append(cells) 

weather_datadf = pd.DataFrame(weather_data) 
datedf = pd.DataFrame(date) 
result = pd.concat([datedf, weather_datadf], axis=1) 
result 

回答

0

這裏是你的錯誤解釋https://httpstatuses.com/406

您應該添加User-Agent到頭部。但我認爲在這個網站存在一些爬行保護,你應該使用更具體的東西,如Scrapy,Crawlera,代理列表,用戶代理旋轉器