從python下載.csv文件（帶重定向）

讓我首先說我知道有幾個主題討論類似於我的問題，但是由於某些原因，建議的解決方案似乎不適用於我。另外，我是使用腳本從互聯網下載文件的新手。到目前爲止，我主要使用python作爲Matlab替代品（使用numpy/scipy）。所以如果我犯了一些愚蠢的錯誤，請和我一起裸照。從python下載.csv文件（帶重定向）

我的目標：我想從互聯網數據庫（http://dna.korea.ac.kr/vhot/）自動使用python下載很多.csv文件。我想這樣做，因爲它太麻煩，下載我需要的1000多個csv文件。數據庫只能使用用戶界面進行訪問，您必須從下拉菜單中選擇多個選項，最終在經過一些步驟後才能鏈接到.csv文件。我發現在填寫下拉菜單並按'搜索'後，您得到的網址包含下拉菜單的所有參數。這意味着我可以改變這些，而不是使用下拉菜單，這有助於很多。

從本網站的一個例子網址是（讓我們稱之爲它URL1）： 爲url1 = http://dna.korea.ac.kr/vhot/search.php?species=Human&selector=drop&mirname=&mirname_drop=hbv-miR-B2RC&pita=on&set=and&miranda_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&gene=

在這個頁面上，我可以選擇5的CSV文件，一個例子引導我到以下網址：

URL2 = http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=and&gene_filter=&method=pita&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&targetscan=&miranda=&rnahybrid=&microt=&pita=on

然而，這並不包括直接的CSV文件，但似乎是一個「重定向」（一個新名詞對我來說，TH在我通過googeling找到的時候，如果我錯了，請糾正我）。

一件奇怪的事。我似乎不得不在我的瀏覽器中加載url1，然後才能訪問url2（我不知道它是否必須是同一天或者小時。url2對我今天不適用，而且它昨天也是如此，只有在訪問url1後它再次工作...）。如果我在url2之前沒有訪問url1，我會從我的瀏覽器中獲取「無結果」而不是我的csv文件。有人知道這裏發生了什麼嗎？

但是，我的主要問題是，我不能保存python的csv文件。我已經嘗試使用包 urllib，urllib2和請求但我不能得到它的工作。從我所瞭解的請求包應該照顧重定向，但我一直無法使其工作。

從以下網頁的解決方案似乎並沒有爲我工作（或我搞亂了）：

stackoverflow.com/questions/7603044/how-to-download-a-file-returned-間接從-HTML的形式提交-PYT

stackoverflow.com/questions/9419162/python-download-returned-zip-file-from-url

techniqal.com/blog/2008/07/ 31/python-file-read-write-with-urllib2/

我試過的一些東西包括：

import urllib2 
import csv 
import sys 

url = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=&microt=&pita=' 

#1 
u = urllib2.urlopen(url) 
localFile = open('file.csv', 'w') 
localFile.write(u.read()) 
localFile.close() 

#2 
req = urllib2.Request(url) 
res = urllib2.urlopen(req) 
finalurl = res.geturl() 
pass 
# finalurl = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=&microt=&pita=' 

#3 
import requests 
r = requests.get(url) 
r.content 
pass 
#r.content = "< s c r i p t > location.replace('download_send.php?name=qgN9Th&type=targetscan'); </s c r i p t >" 

#4 
import requests 
r = requests.get(url, 
allow_redirects=True, 
data={'download_open': 'Download', 'format_open': '.csv'}) 
print r.content 
# r.content = " 

#5 
import urllib 
test1 = urllib.urlretrieve(url, "test.csv") 
test2 = urllib.urlopen(url) 
pass

對於＃2，＃3和＃4，輸出顯示在代碼後面。對於＃1和＃5我只是得到一個。csv文件與</script>'

選項＃3只是給了我一個新的重定向我想，這可以幫助我嗎？

任何人都可以幫我解決我的問題嗎？

來源

2012-04-23 Rubbert

頁面不發送HTTP Redirect，而是通過JavaScript完成重定向。 urllib和requests不處理JavaScript，因此他們無法關注下載網址。您必須自行提取最終下載網址，然後使用任何方法將其打開。

你可以使用re模塊用正則表達式像r'location.replace\((.*?)\)'

來源

2012-04-23 16:21:44 ch3ka

的'r'location.replace \的解釋（（。*？）\）''將有助於 – 2014-11-10 03:26:16

'*'匹配任何字符，'？'使得'非greedy'，所以由內部括號分隔的組匹配'location.replace'的參數括號內的任何內容 - 恰好是javascript重定向到的URL。 – ch3ka 2014-11-11 13:28:53

基於從ch3ka響應提取URL，我想我得到它的工作。從源代碼我得到了Java重定向，並從這個重定向我可以得到的數據。

#Find source code 
redirect = requests.get(url).content 

#Search for the java redirect (find it in the source code) 
# --> based on answer ch3ka 
m = re.search(r"location.replace\(\'(.*?)\'\)", redirect).group(1) 

# Now you need to create url from this redirect, and using this url get the data 
data = requests.get(new_url).content

來源

2012-04-25 07:34:49 Rubbert

從python下載.csv文件（帶重定向）

回答

相關問題