從Python的htttps網站獲取HTML內容

我想從網站獲取HTML代碼並將其寫入文件。它可以很好地處理http站點，但是如果存在SSL鏈接，那麼我會遇到很多錯誤。任何想法如何處理它？從Python的htttps網站獲取HTML內容

from __future__ import print_function 
import io 
import os 
import re 
import ssl 
from urllib.request import urlopen 

    with io.open('words.txt', 'a',encoding="utf-8") as g: 
     url = "https://www.something.some" 
     html = urlopen(url).read() 
     print(html, file = g)

這裏的錯誤

Traceback (most recent call last): 
    File "...\Desktop\mined.py", line 54, in <module> 
    html = urlopen(url).read() 
    File "...\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 163, in urlopen 
    return opener.open(url, data, timeout) 
    File "....\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 472, in open 
    response = meth(req, response) 
    File "...\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 582, in http_response 
    'http', request, response, code, msg, hdrs) 
    File "...\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 510, in error 
    return self._call_chain(*args) 
    File "...\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 444, in _call_chain 
    result = func(*args) 
    File "...\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 590, in http_error_default 
    raise HTTPError(req.full_url, code, msg, hdrs, fp) 
urllib.error.HTTPError: HTTP Error 403: Forbidden

來源

2016-10-25 Vedad

* ...然後我得到很多錯誤。* - 如果你在你的問題中包含錯誤，它確實會有幫助。更好的辦法是使用錯誤字符串進行搜索，因爲其他人遇到同樣的問題並不可能已經解決了。 –

我用錯誤更新它。是的，我嘗試搜索他們，但在大多數情況下，他們沒有做我想做的事情，他們只是檢查網頁的狀態，但我想要的HTML內容 – Vedad

當你說... _it工作正常與http網站_...你是否想要刮同一個網站？（這意味着，唯一的區別是「http：//www.something.some」有效，而「https：//www.something.some」沒有）或者它們是不同的網站（不同的網址）？由於'403' HTTP狀態碼意味着你沒有權限查看某些內容，這通常意味着你沒有正確提供用戶名/密碼，但是這應該發生在http和HTTPS調用中。 – BorrajaX

我會做這樣的：

import urllib 

resp = urllib.urlopen('https://somewebsite.com') # open url 
page = resp.read()        # copy website source to 'page' variable 
text_file = open("Output.txt", "w")    # open txt file 
text_file.write(page)       # insert website source into txt file 
text_file.close()        # close file

來源

2016-10-25 13:57:52 Eilat

urllib.error.HTTPError: HTTP Error 403: Forbidden

錯誤403 Forbidden意味着你有一個成功的SSL連接的網站，但該webserver明確拒絕爲您提供內容。服務器可能不希望您使用https訪問該網站，並且使用瀏覽器訪問相同的URL時出現同樣的錯誤的可能性很高。也可能是服務器未針對https正確配置。

如果您可以使用瀏覽器訪問完全相同的URL，但不能與您的腳本訪問相同的URL，則可能是服務器基於User-Agent或其他內容（例如，防止非瀏覽器訪問）進行的篩選。在這種情況下，瞭解網站的真實URL以幫助您更好是有用的。

來源

2016-10-25 14:17:11

從Python的htttps網站獲取HTML內容

回答

相關問題