2016-09-29 18 views
1

我試圖用機械化模塊下載大型文件(大約1GB),但我一直沒有成功。我一直在尋找類似的線程,但是我只找到那些可以公開訪問的文件,並且不需要登錄即可獲得文件。但是這不是我的情況,因爲該文件位於專用部分,我需要在下載之前登錄。這是我迄今爲止所做的。如何在Python 2中下載大文件

import mechanize 

g_form_id = "" 

def is_form_found(form1): 
    return "id" in form1.attrs and form1.attrs['id'] == g_form_id 

def select_form_with_id_using_br(br1, id1): 
    global g_form_id 
    g_form_id = id1 
    try: 
     br1.select_form(predicate=is_form_found) 
    except mechanize.FormNotFoundError: 
     print "form not found, id: " + g_form_id 
     exit() 

url_to_login = "https://example.com/" 
url_to_file = "https://example.com/download/files/filename=fname.exe" 
local_filename = "fname.exe" 

br = mechanize.Browser() 
br.set_handle_robots(False) # ignore robots 
br.set_handle_refresh(False) # can sometimes hang without this 
br.addheaders = [('User-agent', 'Firefox')] 

response = br.open(url_to_login) 
# Find login form 
select_form_with_id_using_br(br, 'login-form') 
# Fill in data 
br.form['email'] = '[email protected]' 
br.form['password'] = 'password' 
br.set_all_readonly(False) # allow everything to be written to 
br.submit() 

# Try to download file 
br.retrieve(url_to_file, local_filename) 

但我發現了一個錯誤,當512MB下載:

Traceback (most recent call last): 
    File "dl.py", line 34, in <module> 
    br.retrieve(br.retrieve(url_to_file, local_filename) 
    File "C:\Python27\lib\site-packages\mechanize\_opener.py", line 277, in retrieve 
    block = fp.read(bs) 
    File "C:\Python27\lib\site-packages\mechanize\_response.py", line 199, in read 
    self.__cache.write(data) 
MemoryError: out of memory 

你有什麼想法如何解決這個問題? 謝謝

+0

也許嘗試'請求'。 – user3041764

+1

你必須使用機械化嗎? –

+0

不,只要文件下載完成,我不在乎它的完成方式。但是日誌記錄部分存在問題。如果有另一個模塊可以做到這一點,我願意接受。 –

回答

1

您可以使用bs4requests讓你登錄然後寫內容。有幾個表格字段需要包括_token_字段,這是必需的:

from bs4 import BeautifulSoup 
import requests 
from urlparse import urljoin 

data = {'email': '[email protected]', 'password': 'password'} 
base = "https://support.codasip.com" 

with requests.Session() as s: 
    # update headers 
    s.headers.update({'User-agent': 'Firefox'}) 

    # use bs4 to parse the from fields 
    soup = BeautifulSoup(s.get(base).content) 
    form = soup.select_one("#frm-loginForm") 
    # works as it is a relative path. Not always the case. 
    action = form["action"] 

    # Get rest of the fields, ignore password and email. 
    for inp in form.find_all("input", {"name":True,"value":True}): 
     name, value = inp["name"], inp["value"] 
     if name not in data: 
      data[name] = value 
    # login 
    s.post(urljoin(base, action), data=data) 
    # get protected url 
    with open(local_filename, "wb") as f: 
     for chk in s.get(url_to_file, stream=True).iter_content(1024): 
      f.write(chk) 
+0

感謝您的回答,但似乎登錄無效,可能是因爲表單無法正確找到條目。我會盡力弄清楚這一點。 –

+0

@MilanSkála,實際上我們需要一個令牌,我會在一分鐘內編輯 –

+0

@MilanSkála,現在應該工作正常 –