2017-02-24 279 views
1

我編寫了一個腳本來查找SO問題標題中的拼寫錯誤。 我用了大約一個月。這工作正常。Python:urllib.error.HTTPError:HTTP錯誤404:未找到

但是現在,當我嘗試運行它時,我得到了這個結果。

Traceback (most recent call last): 
    File "copyeditor.py", line 32, in <module> 
    find_bad_qn(i) 
    File "copyeditor.py", line 15, in find_bad_qn 
    html = urlopen(url) 
    File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen 
    return opener.open(url, data, timeout) 
    File "/usr/lib/python3.4/urllib/request.py", line 469, in open 
    response = meth(req, response) 
    File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response 
    'http', request, response, code, msg, hdrs) 
    File "/usr/lib/python3.4/urllib/request.py", line 507, in error 
    return self._call_chain(*args) 
    File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain 
    result = func(*args) 
    File "/usr/lib/python3.4/urllib/request.py", line 587, in http_error_default 
    raise HTTPError(req.full_url, code, msg, hdrs, fp) 
urllib.error.HTTPError: HTTP Error 404: Not Found 

這是我的代碼

import json 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 
from enchant import DictWithPWL 
from enchant.checker import SpellChecker 

my_dict = DictWithPWL("en_US", pwl="terms.dict") 
chkr = SpellChecker(lang=my_dict) 
result = [] 


def find_bad_qn(a): 
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active" 
    html = urlopen(url) 
    bsObj = BeautifulSoup(html, "html5lib") 
    que = bsObj.find_all("div", class_="question-summary") 
    for div in que: 
     link = div.a.get('href') 
     name = div.a.text 
     chkr.set_text(name.lower()) 
     list1 = [] 
     for err in chkr: 
      list1.append(chkr.word) 
     if (len(list1) > 1): 
      str1 = ' '.join(list1) 
      result.append({'link': link, 'name': name, 'words': str1}) 


print("Please Wait.. it will take some time") 
for i in range(298314,298346): 
    find_bad_qn(i) 
for qn in result: 
    qn['link'] = "https://stackoverflow.com" + qn['link'] 
for qn in result: 
    print(qn['link'], " Error Words:", qn['words']) 
    url = qn['link'] 

UPDATE

這是造成problem.Even雖然這個網址存在的URL。

https://stackoverflow.com/questions?page=298314&sort=active

我試圖改變的範圍內,以低一些的值。現在它工作正常。

爲什麼這發生在上面的url?

+0

你能打印請求的URL嗎? – LoicM

+0

這一個https://stackoverflow.com/questions?page=298314&sort=active – jophab

+0

這實際上是奇怪的,我可以重現上面約270000上面的每個網頁的完全相同的問題。頁面存在但我得到一個錯誤,當請求與蟒 – LoicM

回答

2

顯然,每頁的默認顯示問題數是50,因此您在循環中定義的範圍超出了可用頁數,每頁有50個問題。範圍應該調整爲總頁數的50以內。

此代碼將捕獲404錯誤,這是您得到錯誤的原因,並忽略它,以防萬一您超出範圍。

from urllib.request import urlopen 

def find_bad_qn(a): 
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active" 
    try: 
     urlopen(url) 
    except: 
     pass 

print("Please Wait.. it will take some time") 
for i in range(298314,298346): 
    find_bad_qn(i) 
+0

但該網址存在。 – jophab

+0

不,它會返回404錯誤代碼,這意味着找不到網址。這是你的錯誤:urllib.error.HTTPError:HTTP錯誤404:未找到 – Atirag

+0

是的。但該網址存在。你可以嘗試一下。我的範圍值不是問題ID。這是頁碼在積極的問題 – jophab