urllib2.urlopen失敗，而urllib.urlopen工作在相同的URL

我想使用urllib和urllib2從特定的網站刮一些數據。urllib2.urlopen失敗，而urllib.urlopen工作在相同的URL

現在urllib主要用於讀取和處理數據，而urllib2的代碼段主要用於讀取和存儲數據。

外部網站經歷了一些變化，雖然urllib代碼部分繼續工作urllib2部分簡單地龍骨翻轉。

所以我做了一些檢查，發現urllib2.urlopen（URL）總是返回一個空白字符串，而urllib.urlopen（URL）總是正常工作。

我進一步挖掘，雙方的urllib和urllib的模塊啓用調試日誌記錄：

>>> response2 =urllib2.urlopen('http://www.xxxxxxxxltd.com/web/guest/attendancelist') 
send: 'GET /web/guest/attendancelist HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxltd.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n' 
reply: 'HTTP/1.1 302 Moved Temporarily\r\n' 
header: Server: nginx/0.7.67 
header: Date: Thu, 28 Nov 2013 19:12:28 GMT 
header: Transfer-Encoding: chunked 
header: Connection: close 
header: Location: http://www.xxxxxxxxplc.com/web/guest/attendancelist 
send: 'GET /web/guest/attendancelist HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxplc.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n' 
reply: 'HTTP/1.1 301 Moved Permanently\r\n' 
header: Server: Apache-Coyote/1.1 
header: Location: /home/new/attendancelist.jsp 
header: Content-Length: 0 
header: Date: Thu, 28 Nov 2013 19:12:26 GMT 
header: Connection: close 
send: 'GET /home/new/attendancelist.jsp HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxplc.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n' 
reply: 'HTTP/1.1 200 OK\r\n' 
header: Server: Apache-Coyote/1.1 
header: Set-Cookie: JSESSIONID=F02B1F76CCCF6F41BE48951F6E1A6205; Path=/home 
header: Content-Type: text/html;charset=utf-8 
header: Content-Length: 0 
header: Date: Thu, 28 Nov 2013 19:12:26 GMT 
header: Connection: close

而且....

>>> html3=urllib.urlopen('http://www.xxxxxxxxltd.com/web/guest/attendancelist') 
send: 'GET /web/guest/attendancelist HTTP/1.0\r\nHost: www.xxxxxxxxltd.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n' 
reply: 'HTTP/1.1 302 Moved Temporarily\r\n' 
header: Server: nginx/0.7.67 
header: Date: Thu, 28 Nov 2013 19:10:36 GMT 
header: Connection: close 
header: Location: http://www.xxxxxxxxplc.com/web/guest/attendancelist 
send: 'GET /web/guest/attendancelist HTTP/1.0\r\nHost: www.xxxxxxxxplc.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n' 
reply: 'HTTP/1.1 301 Moved Permanently\r\n' 
header: Server: Apache-Coyote/1.1 
header: Location: /home/new/attendancelist.jsp 
header: Content-Length: 0 
header: Date: Thu, 28 Nov 2013 19:10:34 GMT 
header: Connection: close 
send: 'GET /home/new/attendancelist.jsp HTTP/1.0\r\nHost: www.xxxxxxxxplc.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n' 
reply: 'HTTP/1.1 200 OK\r\n' 
header: Server: Apache-Coyote/1.1 
header: Set-Cookie: JSESSIONID=8CFB903B80C42CA3DA37EDF90D84FF99; Path=/home 
header: Content-Type: text/html;charset=utf-8 
header: Date: Thu, 28 Nov 2013 19:10:35 GMT 
header: Connection: close

如可辨，在urllib2的連接流有顯著更多的連接標題（其中之一是Connection標題，其值爲Close）。

任何人都可以幫助找到爲什麼urllib2無法檢索數據，而urllib模塊運作良好。

我確定它與Connection標題有關，但我想要某種確認和思考過程解釋。

謝謝。

來源

2013-11-28 Kris Ogirri

我在日誌中看到的唯一區別是Accept-encoding頭。哪些內容是由urllib返回的？ p.ex.它是純html還是gziped？ – alko

真正的問題是，儘管urllib返回頁面的實際內容（純文本被正確地抓取和格式化），但urllib2響應不會返回任何數據（這通過將'Content-Length'值設置爲0來確認urllib2頭信息 –

我會建議使用curl來調試urllib的兩個版本使用的頭文件。有了一些試驗和錯誤，你應該能夠找到導致問題的標題並從那裏開始。

來源

2013-11-29 07:05:05 Phil

感謝您的信息，我會嘗試一下。你有任何鏈接可以幫助我使用CURL重新創建請求嗎？我有點不確定我們是否需要像curl命令行（'wget '或類似的東西），或者我們可以使用基於瀏覽器的解決方案（例如'Fiddler'）。 –

urllib2.urlopen失敗，而urllib.urlopen工作在相同的URL

回答

相關問題