蟒蛇urllib.request裏沒有得到相同的HTML作爲我的瀏覽器

試圖讓http://groupon.cl/descuentos/santiago-centro HTML代碼與以下Python代碼：蟒蛇urllib.request裏沒有得到相同的HTML作爲我的瀏覽器

import urllib.request 
url="http://groupon.cl/descuentos/santiago-centro" 
request = urllib.request.Request(url, headers = {'user-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}) 
response = urllib.request.urlopen(request) 
return response.read().decode('utf-8')

我得到的html代碼，詢問我的位置的頁面。如果我手動打開與我的瀏覽器相同的鏈接（即使使用最近安裝的瀏覽器，也不涉及任何Cookie），我會直接進入帶折扣促銷的頁面。這似乎是一些不會發生在urllib上的重定向操作。我正在使用用戶代理頭嘗試獲取典型瀏覽器的行爲，但我沒有運氣。

我怎樣才能獲得與我的瀏覽器相同的HTML代碼？

來源

2012-11-21 ajendrex

看起來一樣在這裏... – LtWorf

我認爲你可以運行這個命令：

wget -d http://groupon.cl/descuentos/santiago-centro

，你會看到wget的打印兩個HTTP請求和響應頁面保存到一個文件中。

- HTTP/1.1 302 Moved Temporarily 
- HTTP/1.1 200 OK

並且該文件的內容是你想要的html代碼。

第一個響應碼是302，所以urllib.requst.urlopen做第二個請求。但它不是設置從第一個響應得到正確的cookie，服務器不能阻止第二個請求，所以你會得到另一個頁面。

http.client模塊不會自行處理301或302 http響應。

import http 

conn = http.client.HTTPConnection("groupon.cl") 
#do first request 
conn.request("GET", "/descuentos/santiago-centro") 
print(conn.status) # 301 or 302 
print(conn.getheaders()) # set-Cookie 

#get the cookie 
headers = .... 
#do second request 

conn.requesst("GET", "/", headers) 
...... 
...... 
#Get response page.

來源

2012-12-10 14:08:48 pexeer

蟒蛇urllib.request裏沒有得到相同的HTML作爲我的瀏覽器

回答

相關問題