2016-12-06 121 views
0

我正在Python 2.7中編寫一個簡單的webcrawler,並且在嘗試從HTTPS網站檢索robots.txt文件時正在獲取SSL證書驗證失敗異常。RobotParser引發SSL證書驗證失敗異常

下面是相關代碼:

def getHTMLpage(pagelink, currenttime): 
    "Downloads HTML page from server" 
    #init 
    #parse URL and get domain name 
    o = urlparse.urlparse(pagelink,"http") 
    if o.netloc == "": 
     netloc = re.search(r"[^/]+\.[^/]+\.[^/]+", o.path) 
     if netloc: 
      domainname="http://"+netloc.group(0)+"/" 
    else: 
     domainname=o.scheme+"://"+o.netloc+"/" 
    if o.netloc != "" and o.netloc != None and o.scheme != "mailto": #if netloc isn't empty and it's not a mailto link 
     link=domainname+o.path[1:]+o.params+"?"+o.query+"#"+o.fragment 
     if not (robotfiledictionary.get(domainname)): #if robot file for domainname was not downloaded 
      robotfiledictionary[domainname] = robotparser.RobotFileParser() #initialize robots.txt parser 
      robotfiledictionary[domainname].set_url(domainname+"robots.txt") #set url for robots.txt 
      print " Robots.txt for %s initial download" % str(domainname) 
      robotfiledictionary[domainname].read() #download/read robots.txt 
     elif (robotfiledictionary.get(domainname)): #if robot file for domainname was already downloaded 
      if (currenttime - robotfiledictionary[domainname].mtime()) > 3600: #if robot file is older than 1 hour 
       robotfiledictionary[domainname].read() #download/read robots.txt 
       print " Robots.txt for %s downloaded" % str(domainname) 
       robotfiledictionary[domainname].modified() #update time 
     if robotfiledictionary[domainname].can_fetch("WebCrawlerUserAgent", link): #if access is allowed... 
      #fetch page 
      print link 
      page = requests.get(link, verify=False) 
      return page.text() 
     else: #otherwise, report 
      print " URL disallowed due to robots.txt from %s" % str(domainname) 
      return "URL disallowed due to robots.txt" 
    else: #if netloc was empty, URL wasn't parsed. report 
     print "URL not parsed: %s" % str(pagelink) 
     return "URL not parsed" 

而這裏的,我發現了異常:

Robots.txt for https://ehi-siegel.de/ initial download 
Traceback (most recent call last): 
    File "C:\webcrawler.py", line 561, in <module> 
    HTMLpage = getHTMLpage(link, loopstarttime) 
    File "C:\webcrawler.py", line 122, in getHTMLpage 
    robotfiledictionary[domainname].read() #download/read robots.txt 
    File "C:\Python27\lib\robotparser.py", line 58, in read 
    f = opener.open(self.url) 
    File "C:\Python27\lib\urllib.py", line 213, in open 
    return getattr(self, name)(url) 
    File "C:\Python27\lib\urllib.py", line 443, in open_https 
    h.endheaders(data) 
    File "C:\Python27\lib\httplib.py", line 1053, in endheaders 
    self._send_output(message_body) 
    File "C:\Python27\lib\httplib.py", line 897, in _send_output 
    self.send(msg) 
    File "C:\Python27\lib\httplib.py", line 859, in send 
    self.connect() 
    File "C:\Python27\lib\httplib.py", line 1278, in connect 
    server_hostname=server_hostname) 
    File "C:\Python27\lib\ssl.py", line 353, in wrap_socket 
    _context=self) 
    File "C:\Python27\lib\ssl.py", line 601, in __init__ 
    self.do_handshake() 
    File "C:\Python27\lib\ssl.py", line 830, in do_handshake 
    self._sslobj.do_handshake() 
IOError: [Errno socket error] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590) 

正如你所看到的,我已經改變了代碼在結束檢索頁面忽略SSL證書(我知道這是在生產中皺起眉頭,但我想測試它),但現在似乎robotparser.read()函數未通過SSL驗證。我已經看到我可以手動下載證書並指出該方向以驗證SSL證書,但理想情況下,我想讓我的程序「現成」,因爲我本人不會成爲一個使用它。有誰知道該怎麼做?

編輯:我進入了robotparser.py。我加

import requests 

,改變線58

f = requests.get(self.url, verify=False) 

,這似乎已經解決了。這仍然不是很理想,所以我仍然樂於接受關於如何做的建議。

回答

0

我自己找到了解決方案。使用urllib3的請求功能,我能夠認證所有網站並繼續訪問它們。

我仍然必須編輯robotparser.py文件。這是我加入到開頭:

import urllib3 
import urllib3.contrib.pyopenssl 
import certifi 
urllib3.contrib.pyopenssl.inject_into_urllib3() 
http = urllib3.PoolManager(cert_reqs="CERT_REQUIRED", ca_certs=certifi.where()) 

這是讀的定義(個體經營):

def read(self): 
    """Reads the robots.txt URL and feeds it to the parser.""" 
    opener = URLopener() 
    f = http.request('GET', self.url) 
    lines = [line.strip() for line in f.data] 
    f.close() 
    self.errcode = opener.errcode 
    if self.errcode in (401, 403): 
     self.disallow_all = True 
    elif self.errcode >= 400 and self.errcode < 500: 
     self.allow_all = True 
    elif self.errcode == 200 and lines: 
     self.parse(lines) 

我也用同樣的過程中得到我的程序的功能,實際的頁面請求:

def getHTMLpage(pagelink, currenttime): 
    "Downloads HTML page from server" 
    #init 
    #parse URL and get domain name 
    o = urlparse.urlparse(pagelink,u"http") 
    if o.netloc == u"": 
     netloc = re.search(ur"[^/]+\.[^/]+\.[^/]+", o.path) 
     if netloc: 
      domainname=u"http://"+netloc.group(0)+u"/" 
    else: 
     domainname=o.scheme+u"://"+o.netloc+u"/" 
    if o.netloc != u"" and o.netloc != None and o.scheme != u"mailto": #if netloc isn't empty and it's not a mailto link 
     link=domainname+o.path[1:]+o.params+u"?"+o.query+u"#"+o.fragment 
     if not (robotfiledictionary.get(domainname)): #if robot file for domainname was not downloaded 
      robotfiledictionary[domainname] = robotparser.RobotFileParser() #initialize robots.txt parser 
      robotfiledictionary[domainname].set_url(domainname+u"robots.txt") #set url for robots.txt 
      print u" Robots.txt for %s initial download" % str(domainname) 
      robotfiledictionary[domainname].read() #download/read robots.txt 
     elif (robotfiledictionary.get(domainname)): #if robot file for domainname was already downloaded 
      if (currenttime - robotfiledictionary[domainname].mtime()) > 3600: #if robot file is older than 1 hour 
       robotfiledictionary[domainname].read() #download/read robots.txt 
       print u" Robots.txt for %s downloaded" % str(domainname) 
       robotfiledictionary[domainname].modified() #update time 
     if robotfiledictionary[domainname].can_fetch("WebCrawlerUserAgent", link.encode('utf-8')): #if access is allowed... 
      #fetch page 
      if domainname == u"https://www.otto.de/" or domainname == u"http://www.otto.de": 
       driver.get(link.encode('utf-8')) 
       time.sleep(5) 
       page=driver.page_source 
       return page 
      else: 
       page = http.request('GET',link.encode('utf-8')) 
       return page.data.decode('UTF-8','ignore') 
     else: #otherwise, report 
      print u" URL disallowed due to robots.txt from %s" % str(domainname) 
      return u"URL disallowed due to robots.txt" 
    else: #if netloc was empty, URL wasn't parsed. report 
     print u"URL not parsed: %s" % str(pagelink) 
     return u"URL not parsed" 

您還會注意到我改變了我的程序中使用嚴格UTF-8,但這是無關的。

相關問題