如何刮蟒蛇與旁路錯誤

我有大約3000個網址，其中一些工作，而其中一些沒有。我嘗試運行美麗的湯，但我得到了一些不同的錯誤，這使我困惑 - 我不知道什麼樣的嘗試，除了塊我應該放在我的代碼。我想要做的是忽略所有內部服務器錯誤的URL，只能使用那些沒有錯誤的文件，並獲取下面代碼中寫入的文本。如何刮蟒蛇與旁路錯誤

我的代碼：

mega = [[]] # list in a list 
for i in range(len(ab)): # ab as a dictionary with multiple keys 
...  myurl = soc[i]['the_urls'] 
...  html = urllib2.urlopen(myurl).read() 
...  soup = BeautifulSoup(html, "html.parser") 
...  row = soup.findAll('tr') 
...  for r in row: 
...   mega.append([r.get_text()]) # scrape all the texts

錯誤：

Traceback (most recent call last): 
    File "<stdin>", line 3, in <module> 
    File "/Users/name/anaconda/lib/python2.7/urllib2.py", line 154, in urlopen 
    return opener.open(url, data, timeout) 
    File "/Users/name/anaconda/lib/python2.7/urllib2.py", line 435, in open 
    response = meth(req, response) 
    File "/Users/name/anaconda/lib/python2.7/urllib2.py", line 548, in http_response 
    'http', request, response, code, msg, hdrs) 
    File "/Users/name/anaconda/lib/python2.7/urllib2.py", line 473, in error 
    return self._call_chain(*args) 
    File "/Users/name/anaconda/lib/python2.7/urllib2.py", line 407, in _call_chain 
    result = func(*args) 
    File "/Users/name/anaconda/lib/python2.7/urllib2.py", line 556, in http_error_default 
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) 
urllib2.HTTPError: HTTP Error 500: Internal Server Error

是否錯誤意味着所有網址都具有同樣的問題 - 內部服務器錯誤？在這種情況下，我想我能做的一個方法是包含一個try和except塊，它表示如果沒有http錯誤500，則嘗試嘗試，如果有，則傳遞。

編輯：

我試着用下面的代碼繞過錯誤，我不知道它是否工作，特別是如果「合格」或「繼續」做正確的事：

for i in range(len(soc)): 
...  myurl = soc[i]['report_url'] 
...  while True: 
...   try: 
...    html = urllib2.urlopen(myurl).read() 
...    break 
...   except urllib2.HTTPError: 
...    continue 
...  soup = BeautifulSoup(html, "html.parser") 
...  row = soup.findAll('tr') 
...  for r in row: 
...   mega.append([r.get_text()]) # scrape the text

來源

2017-04-23 song0089

不，它實際上意味着一個URL返回500錯誤。你應該用'try' /'except'來處理它。 –

您在上述註釋中的代碼是不可讀的。 –

您的編輯將循環，直到通話成功。可能是永遠的。 –

Does the error mean that all the urls have the same problem - internal server error?

並不總是如此。由於5XX服務器錯誤的原因可能是：

基礎設施發生故障 - 並且在網站的所有網頁都將無法訪問
在特定頁面上出現一些錯誤，例如，DivisionByZero而計算該特定數據

使用try/except來處理問題並轉到下一個網址。

此外，如果你與你可能會做出某種類型的統計，並且如果你在某個域上看到很多錯誤 - 猜測它現在不工作，並將其URL移到列表的底部。

來源

2017-04-23 05:14:35

嗨，我得到了使用嘗試和除了感，但我不知道如果我的代碼工作 - 我做了一些編輯，你可以評論？謝謝。 – song0089

如果其中一個URL無法永久訪問，您可能會鎖定它。你可以嘗試另一種方式：瀏覽整個網址列表，如果一個網址不工作 - 將它移動到列表的末尾並轉到下一個網址。重複整個過程幾次=='MAX_ATTEMPS'。另外，如果你有很多解析數據的計劃，你可以看看https://scrapy.org/。它允許非常快速地編寫複雜的抓取器。例如 - https://gist.github.com/lisitsky/c4aac52edcb7abfd5975be067face1bb - 用16個線程解析weblancer站點，並在幾秒鐘內返回4k行CSV。 –

如何刮蟒蛇與旁路錯誤

回答

相關問題