2014-12-29 8 views
4

我想從這個網站全自動一些數據: http://www.casablanca-bourse.com/bourseweb/en/Negociation-History.aspx?Cat=24&IdLink=225的Python - 下載從ASPX格式的文件

在使用Python的urllib2,我順利地拿到了一個HTML文件,如果我點擊「提交」按鈕在這個網站。

但是,當我模擬點擊鏈接「下載數據」的行爲時,我得到了任何輸出。

我的代碼是:

import urllib 
import urllib2 

uri = 'http://www.casablanca-bourse.com/bourseweb/en/Negociation-History.aspx?Cat=24&IdLink=225' 
headers = { 
    'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36', 
    'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' 
} 

formFields = (
    (r'TopControl1$ScriptManager1', r'HistoriqueNegociation1$UpdatePanel1|HistoriqueNegociation1$HistValeur1$LinkButton1'), 
    (r'__EVENTTARGET', r'HistoriqueNegociation1$HistValeur1$LinkButton1'), 
    (r'__EVENTARGUMENT', r''), 
    (r'__VIEWSTATE', r'/wEPDwUKMTcy/ ... +ZHYQBq1hB/BZ2BJyHdLM='), #just a small part because it's so long ! 
    (r'TopControl1$TxtRecherche', r''), 
    (r'TopControl1$txtValeur', r''), 
    (r'HistoriqueNegociation1$HistValeur1$DDValeur', r'9000 '), 
    (r'HistoriqueNegociation1$HistValeur1$historique', r'RBSearchDate'), 
    (r'HistoriqueNegociation1$HistValeur1$DateTimeControl1$TBCalendar', r'22/12/2014'), 
    (r'HistoriqueNegociation1$HistValeur1$DateTimeControl2$TBCalendar', r'28/12/2014'), 
    (r'HistoriqueNegociation1$HistValeur1$DDuree', r'6'), 
    (r'hiddenInputToUpdateATBuffer_CommonToolkitScripts', r'1') 
) 


encodedFields = urllib.urlencode(formFields) 

req = urllib2.Request(uri, encodedFields, headers) 
f = urllib2.urlopen(req) 

我應該怎麼才能得到,如果我點擊該網站的「下載數據」鏈接相同的文件嗎?

謝謝

+1

因爲每次你檢索一個頁面時'ASP.NET'窗體特定的值都會改變,所以你需要從你得到的HTML中解析這些值,而不是對它們進行硬編碼。 – alecxe

回答

0

首先,我建議你usуrequests庫,而不是urllib的。此外,我們需要一個BeautifulSoup與HTML標籤的工作:

pip install requests 

pip install beautifulsoup4 

比,代碼如下所示:

import requests 
from bs4 import BeautifulSoup 

session = requests.Session() 

payload = { 
    r'TopControl1$ScriptManager1': r'HistoriqueNegociation1$UpdatePanel1|HistoriqueNegociation1$HistValeur1$LinkButton1', 
    r'__EVENTTARGET': r'HistoriqueNegociation1$HistValeur1$LinkButton1', 
    r'__EVENTARGUMENT': r'', 
    r'TopControl1$TxtRecherche': r'', 
    r'TopControl1$txtValeur': r'', 
    r'HistoriqueNegociation1$HistValeur1$DDValeur': r'9000 ', 
    r'HistoriqueNegociation1$HistValeur1$historique': r'RBSearchDate', 
    r'HistoriqueNegociation1$HistValeur1$DateTimeControl1$TBCalendar': r'22/12/2014', 
    r'HistoriqueNegociation1$HistValeur1$DateTimeControl2$TBCalendar': r'28/12/2014', 
    r'HistoriqueNegociation1$HistValeur1$DDuree': r'6', 
    r'hiddenInputToUpdateATBuffer_CommonToolkitScripts': r'1' 
    } 


uri = 'http://www.casablanca-bourse.com/bourseweb/en/Negociation-History.aspx?Cat=24&IdLink=225' 
r = session.get(uri) 

#Find __VIEWSTATE value, there is only one input tag with type="hidden" 
soup = BeautifulSoup(r.text) 
viewstate_tag = soup.find('input', attrs={"type" : "hidden"}) 
payload[viewstate_tag['name']] = viewstate_tag['value'] 

r = session.post(uri, payload) 
print r.text #contains html table with data 

首先,我們得到的原始頁面,提取__VIEWSTATE值,並使用了該值第二個請求。

+0

謝謝sooooo多NorthCat,你救了我的一天:) – BilNash