蟒蛇 - 網頁抓取BeautifulSoup和urllib的

我使用python 3.4和我的腳本是這樣的：蟒蛇 - 網頁抓取BeautifulSoup和urllib的

import urllib 
from urllib.request import Request, urlopen 
from urllib.error import URLError, HTTPError 
from bs4 import BeautifulSoup 

url = "http://www.embassy-worldwide.com/" 

headers={'User-Agent': 'Mozilla/5.0'} 
#req = Request(url, headers) 

try: 
    req = urllib.request.Request(url, headers) 
    #print (req) 
except HTTPError as e: 
    print('Error code: ', e.code) 
except URLError as e: 
    print('Reason: ', e.reason) 
else: 
    print('good!') 

print (req) 

#html = urllib.request.urlopen(req) 
with urllib.request.urlopen(req) as response: 
    html = response.read() 
print(html)

上述導致錯誤代碼：

ValueError異常：內容長度應指定可以迭代的數據{'User-Agent'：'Mozilla/5.0'}

如何獲取html代碼然後遍歷標籤以獲取所有國家的列表？

來源

2016-03-05 Alg_D

請使用'urllib3'。 – 2016-03-05 12:47:34

urllib有什麼不好？你能舉一個例子作爲解決方案嗎？ –

'urllib'有許多已知的缺陷，它們在'urllib2'和'urllib3'（以及'request'，它基於'urllib3'）中被修復。如果沒有任何好的指示，urllib可能會隨機失敗（特別是在高負載的情況下）。此外，在這個庫中，在社區中，使用最新版本來防止舊版本可能會自動解決的問題是很常見的。 – 2016-03-05 12:53:26

嘗試這種風格在urllib3：

import sys 
import re 
import time 
import pprint 
import codecs 
import unicodedata 
import urllib3 
import json 

urllib3.disable_warnings() 

cookie = '_session_id=29913b5f1b8836d2a8387ef4db00745e' 
header = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/536.26.17 (KHTML, like Gecko) Version/6.0.2 Safari/536.26.17' 
url = 'https://yoururl.com/' 
m = urllib3.PoolManager(num_pools = 15) 

r = m.request('GET', url, None, {'User-Agent' : header, 'Cookie' : cookie}) 

print(r.data)

的進口超過需要。這只是我使用的刮刀的一大部分的片段。我使用一些正則表達式，因爲我需要的小片段在正則表達式中比完整的優化器實現更快。

來源

2016-03-05 12:50:07

謝謝，通過使用url：* http：//www.embassy-worldwide.com*，你會怎麼樣該頁面的HTML，所以我可以用它來刮頁面？ –

'r.data'包含HTTP Response主體的原始轉儲。 – 2016-03-05 12:55:19

爲一個請求簡化代碼。如果不需要，請刪除Cookie條目。 – 2016-03-05 13:00:21

蟒蛇 - 網頁抓取BeautifulSoup和urllib的

回答

相關問題