來自新波士頓的Python Web爬蟲

我最近在使用python編寫web爬蟲時觀看了新視頻視頻。出於某種原因，我得到一個SSLError。我試圖用第6行代碼修復它，但沒有運氣。任何想法爲什麼它會拋出錯誤？該代碼是從逐字記錄的新波士頓。來自新波士頓的Python Web爬蟲

import requests 
from bs4 import BeautifulSoup 

def creepy_crawly(max_pages): 
    page = 1 
    #requests.get('https://www.thenewboston.com/', verify = True) 
    while page <= max_pages: 

     url = "https://www.thenewboston.com/trade/search.php?pages=" + str(page) 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text) 

     for link in soup.findAll('a', {'class' : 'item-name'}): 
      href = "https://www.thenewboston.com" + link.get('href') 
      print(href) 

     page += 1 

creepy_crawly(1)

來源

2014-11-24 Steven

SSL錯誤是由於到Web證書。它可能是因爲你試圖抓取的url是'https'。嘗試只有http的其他網站。 – Craicerjack 2014-11-24 19:24:02

可能的重複http://stackoverflow.com/q/10667960/783219 – Prusse 2014-11-24 19:46:30

謝謝Craicerjack！我在網站上嘗試了它，而不僅僅是「http」，它起作用了！但是，我將如何去使用「https」在域上運行網絡爬蟲？ – Steven 2014-11-24 20:10:12

我使用的urllib，它可以更快地做了一個網絡爬蟲，沒有問題訪問https網頁，但有一件事是，它不驗證服務器證書，這使其更快更危險（易受mitm攻擊）。婁有這麼LIB的使用示例：

link = 'https://www.stackoverflow.com'  
html = urllib.urlopen(link).read() 
print(html)

3系是所有你需要從一個頁面抓取的HTML，簡單，不是嗎？

我也建議你使用正則表達式的HTML搶等環節，對於一個例子（重新使用庫）將是：

for url in re.findall(r'<a[^>]+href=["\'](.[^"\']+)["\']', html, re.I): # Searches the HTML for other URLs 
     link = url.split("#", 1)[0] \ 
     if url.startswith("http") \ 
     else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] # Checks if the HTML is valid and format it

來源

2016-11-29 06:19:42 ArthurG

是不是一般的規則，你不應該使用正則表達式來解析HTML？ – Steven 2016-12-05 18:00:55

正則表達式在許多語言中被認爲是很慢的，但python似乎並不是這種情況，我的網絡爬蟲每秒能夠處理10個鏈接，除非你想要比這個正則表達式更快的東西能夠爲你服務，不用說正則表達式很精確。 – ArthurG 2016-12-06 19:00:28

來自新波士頓的Python Web爬蟲

回答

相關問題