我想構建一個刮板,但我不斷收到503阻止錯誤。我仍然可以手動訪問網站,所以我的IP地址沒有被阻止。我不停地切換用戶代理,仍然無法讓我的代碼一直運行。有時我會達到15歲,有時候我沒有得到,但最終總是失敗。我毫不懷疑我在代碼中做錯了什麼。不過,我確實把它剃掉了,所以請記住這一點。如何在不使用第三方的情況下解決此問題?與HTTP的網絡刮板錯誤503:服務不可用
import requests
import urllib2
from urllib2 import urlopen
import random
from contextlib import closing
from bs4 import BeautifulSoup
import ssl
import parser
import time
from time import sleep
def Parser(urls):
randomint = random.randint(0, 2)
randomtime = random.randint(5, 30)
url = "https://www.website.com"
user_agents = [
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)",
"Opera/9.80 (Windows NT 6.1; U; cs) Presto/2.2.15 Version/10.00"
]
index = 0
opener = urllib2.build_opener()
req = opener.addheaders = [('User-agent', user_agents[randomint])]
def ReadUPC():
UPCList = [
'upc',
'upc2',
'upc3',
'upc4',
'etc.'
]
extracted_data = []
for i in UPCList:
urls = "https://www.website.com" + i
randomtime = random.randint(5, 30)
Soup = BeautifulSoup(urlopen(urls), "lxml")
price = Soup.find("span", { "class": "a-size-base a-color-price s-price a-text-bold"})
sleep(randomtime)
randomt = random.randint(5, 15)
print "ref url:", urls
sleep(randomt)
print "Our price:",price
sleep(randomtime)
if __name__ == "__main__":
ReadUPC()
index = index + 1
sleep(10)
554 class HTTPDefaultErrorHandler(BaseHandler):
555 def http_error_default(self, req, fp, code, msg, hdrs):
556 raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
557
558 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 503: Service Unavailable
你的代碼是不可能遵循的,你爲什麼要混合這樣的庫? –
我剪掉了一些我正在嘗試的東西。我爲臨時演員道歉。 – jstats
你爲什麼使用pycurl,urllib2,requests和urllib? –