2016-12-26 32 views
3

我寫了下面的代碼來從Google Scholar security page.刮數據。但是,每當我運行它我收到此錯誤:防止503錯誤時挖Google Scholar

Traceback (most recent call last): 
    File "/Users/.../Documents/GS_Tag_Scraper/scrape-modified.py", line 53, in <module> 
    getProfileFromTag(each) 
    File "/Users/.../Documents/GS_Tag_Scraper/scrape-modified.py", line 32, in getProfileFromTag 
    page = urllib.request.urlopen(url) 
    File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 163, in urlopen 
    return opener.open(url, data, timeout) 
    File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 472, in open 
    response = meth(req, response) 
    File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 582, in http_response 
    'http', request, response, code, msg, hdrs) 
    File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 504, in error 
    result = self._call_chain(*args) 
    File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 444, in _call_chain 
    result = func(*args) 
    File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 696, in http_error_302 
    return self.parent.open(new, timeout=req.timeout) 
    File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 472, in open 
    response = meth(req, response) 
    File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 582, in http_response 
    'http', request, response, code, msg, hdrs) 
    File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 510, in error 
    return self._call_chain(*args) 
    File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 444, in _call_chain 
    result = func(*args) 
    File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 590, in http_error_default 
    raise HTTPError(req.full_url, code, msg, hdrs, fp) 
urllib.error.HTTPError: HTTP Error 503: Service Unavailable 

我認爲這是因爲GS阻止我的請求。我怎樣才能防止這一點?

的代碼是:

# -*- coding: utf-8 -*- 
from bs4 import BeautifulSoup 
import urllib.request 
import string 
import csv 
import time 

#Declares array's to store data 
name = [] 
urlList =[] 

#Opens and writer header of CSV file 
outputFile = open('sample.csv', 'w', newline='') 
outputWriter = csv.writer(outputFile) 
outputWriter.writerow(['Name', 'URL', 'Total Citations', 'h-index', 'i10-index']) 

def getStat (url): 
    #Given an authors URL it retunrs an array of stats. 
    url = 'https://scholar.google.pl' + url 
    page = urllib.request.urlopen(url) 
    soup = BeautifulSoup(page, 'lxml') 
    buttons = soup.findAll("td", { "class" : "gsc_rsb_std" }) 
    list=[] 
    return (list) 

def getProfileFromTag(tag): 
    url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:" + tag 
    while True: 
     page = urllib.request.urlopen(url) 
     soup = BeautifulSoup(page, 'lxml') 

     mydivs = BeautifulSoup(urllib.request.urlopen(url), 'lxml').findAll("h3", { "class" : "gsc_1usr_name"}) 
     for each in mydivs: 
      for anchor in each.find_all('a'): 
       name.append(anchor.text) 
       urlList.append(anchor['href']) 
       time.sleep(0.001) 
     buttons = soup.findAll("button", {"aria-label": "Następna"}) 
     if not buttons: 
      break 
     on_click = buttons[0].get('onclick') 
     url = 'http://scholar.google.pl' + on_click[17:-1] 
     url = url.encode('utf-8').decode('unicode_escape') 
    for each in name: 
     list = getStat(urlList[i]) 
     outputWriter.writerow([each, urlList[i], list[0], list[2], list[4]]) 

tags = ['security'] 
for each in tags: 
    getProfileFromTag(each) 
+1

請簡化您的代碼示例(有一些尷尬)。並提供一個堆棧跟蹤。在打開之前打印計算的URL以便調試。我敢肯定,你會自己發現錯誤。 –

+0

你可以嘗試在請求頭中設置'referer'字段。這在一些網站上適用於我。 https://en.wikipedia.org/wiki/HTTP_referer – heltonbiker

+0

@LaurentLAPORTE我已經這樣做了,但是我仍然無法找到該錯誤。 – user7340814

回答

0

使用requests適當的請求頭,而不是一起。

import requests 

url = 'https://scholar.google.pl/citations?view_op=search_authors&mauthors=label:security' 

request_headers = { 
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 
    'accept-encoding': 'gzip, deflate, br', 
    'accept-language': 'en-US,en;q=0.8', 
    'upgrade-insecure-requests': '1', 
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' 
} 

with requests.Session() as s: 
    r = s.get(url, headers=request_headers) 

你得到的結果:

Adrian Perrig /citations?user=n-Oret4AAAAJ&hl=pl 
Vern Paxson  /citations?user=HvwPRJ0AAAAJ&hl=pl 
Frans Kaashoek /citations?user=YCoLskoAAAAJ&hl=pl 
Mihir Bellare /citations?user=2pW1g5IAAAAJ&hl=pl 
Xuemin Shen  /citations?user=Bjl3GwoAAAAJ&hl=pl 
Helen J. Wang /citations?user=qhu-DxwAAAAJ&hl=pl 
Sushil Jajodia /citations?user=lOZ1vHIAAAAJ&hl=pl 
Martin Abadi  /citations?user=vWTI60AAAAAJ&hl=pl 
Jean-Pierre Hubaux /citations?user=W7YBLlEAAAAJ&hl=pl 
Ross Anderson /citations?user=WgyDcoUAAAAJ&hl=pl 

使用此:

users = soup.findAll('h3', {'class': 'gsc_oai_name'}) 
for user in users: 
    name = user.a.text.strip() 
    link = user.a['href'] 
    print(name, '\t', link) 

可以發現瀏覽器通過研究Chrome的開發者工具的網絡選項卡發送頭。