2017-07-19 30 views
0

我有一些代碼使用報紙去看看各種媒體並從他們那裏下載文章。這一直工作很好,但最近開始行動起來。我可以看到問題出在哪裏,但是因爲我是Python新手,我不確定解決這個問題的最佳方式。基本上(我認爲)我需要做一些修改,以防止偶爾出現的格式錯誤的Web地址完全崩潰,並允許它免去該網址並轉到其他地址。在報紙上處理文章例外

錯誤的起源是,當我嘗試使用下載的文章:

article.download() 

一些文章(他們每天都在變化明顯)會引發以下錯誤,但腳本繼續運行:

Traceback (most recent call last): 
     File "C:\Anaconda3\lib\encodings\idna.py", line 167, in encode 
     raise UnicodeError("label too long") 
    UnicodeError: label too long 

    The above exception was the direct cause of the following exception: 

    Traceback (most recent call last): 
    File "C:\Anaconda3\lib\site-packages\newspaper\mthreading.py", line 38, in run 
     func(*args, **kargs) 
    File "C:\Anaconda3\lib\site-packages\newspaper\source.py", line 350, in download_articles 
     html = network.get_html(url, config=self.config) 
    File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 39, in get_html return get_html_2XX_only(url, config, response) 
    File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 60, in get_html_2XX_only url=url, **get_request_kwargs(timeout, useragent)) 
    File "C:\Anaconda3\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs) 
    File "C:\Anaconda3\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs) 
    File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 502, in request resp = self.send(prep, **send_kwargs) 
    File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 612, in send r = adapter.send(request, **kwargs) 
    File "C:\Anaconda3\lib\site-packages\requests\adapters.py", line 440, in send timeout=timeout 
    File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked) 
    File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 356, in _make_request conn.request(method, url, **httplib_request_kw) 
    File "C:\Anaconda3\lib\http\client.py", line 1107, in request self._send_request(method, url, body, headers) 
    File "C:\Anaconda3\lib\http\client.py", line 1152, in _send_request self.endheaders(body) 
    File "C:\Anaconda3\lib\http\client.py", line 1103, in endheaders  self._send_output(message_body) 
    File "C:\Anaconda3\lib\http\client.py", line 934, in _send_output self.send(msg) 
    File "C:\Anaconda3\lib\http\client.py", line 877, in send  self.connect() 
    File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 166, in connect conn = self._new_conn() 
    File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 141, in _new_conn (self.host, self.port), self.timeout, **extra_kw) 
    File "C:\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): 
    File "C:\Anaconda3\lib\socket.py", line 733, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): 
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long) 

下位應該然後解析並在每篇文章運行的自然語言處理和寫入某些元素的數據幀,所以我再有:

for paper in papers:  
for article in paper.articles: 
    article.parse() 
    print(article.title) 
    article.nlp() 
    if article.publish_date is None: 
     d = datetime.now().date() 
    else: 
     d = article.publish_date.date() 
    stories.loc[i] = [paper.brand, d, datetime.now().date(), article.title, article.summary, article.keywords, article.url] 
    i += 1 

(這可能是一個有點草率太但那是另一天的問題)

這運行正常,直到它得到與錯誤的網址之一,然後扔了一篇文章例外和腳本崩潰:

C:\Anaconda3\lib\site-packages\PIL\TiffImagePlugin.py:709: UserWarning: Corrupt EXIF data. Expecting to read 2 bytes but only got 0. 
    warnings.warn(str(msg)) 

    ArticleException       Traceback (most recent call last) <ipython-input-17-2106485c4bbb> in <module>() 
      4 for paper in papers: 
      5  for article in paper.articles: 
    ----> 6   article.parse() 
      7   print(article.title) 
      8   article.nlp() 

    C:\Anaconda3\lib\site-packages\newspaper\article.py in parse(self) 
     183 
     184  def parse(self): 
    --> 185   self.throw_if_not_downloaded_verbose() 
     186 
     187   self.doc = self.config.get_parser().fromstring(self.html) 

    C:\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self) 
     519   if self.download_state == ArticleDownloadState.NOT_STARTED: 
     520    print('You must `download()` an article first!') 
    --> 521    raise ArticleException() 
     522   elif self.download_state == ArticleDownloadState.FAILED_RESPONSE: 
     523    print('Article `download()` failed with %s on URL %s' % 

    ArticleException: 

那麼最好的辦法是避免終止我的腳本?我是否應該在下載階段解決這個問題:我得到unicode錯誤,或者在解析階段告訴它忽略那些不好的地址?我將如何去實施這一修正?

真的很感謝任何建議。

回答

0

我有同樣的問題,雖然在一般使用不同的是:通過不推薦,以下爲我工作:

try: 
     a.parse() 
     file.write(a.title+'\n') 
    except : 
     pass