如何在Python 2中下載大文件

我試圖用機械化模塊下載大型文件（大約1GB），但我一直沒有成功。我一直在尋找類似的線程，但是我只找到那些可以公開訪問的文件，並且不需要登錄即可獲得文件。但是這不是我的情況，因爲該文件位於專用部分，我需要在下載之前登錄。這是我迄今爲止所做的。如何在Python 2中下載大文件

import mechanize 

g_form_id = "" 

def is_form_found(form1): 
    return "id" in form1.attrs and form1.attrs['id'] == g_form_id 

def select_form_with_id_using_br(br1, id1): 
    global g_form_id 
    g_form_id = id1 
    try: 
     br1.select_form(predicate=is_form_found) 
    except mechanize.FormNotFoundError: 
     print "form not found, id: " + g_form_id 
     exit() 

url_to_login = "https://example.com/" 
url_to_file = "https://example.com/download/files/filename=fname.exe" 
local_filename = "fname.exe" 

br = mechanize.Browser() 
br.set_handle_robots(False) # ignore robots 
br.set_handle_refresh(False) # can sometimes hang without this 
br.addheaders = [('User-agent', 'Firefox')] 

response = br.open(url_to_login) 
# Find login form 
select_form_with_id_using_br(br, 'login-form') 
# Fill in data 
br.form['email'] = '[email protected]' 
br.form['password'] = 'password' 
br.set_all_readonly(False) # allow everything to be written to 
br.submit() 

# Try to download file 
br.retrieve(url_to_file, local_filename)

但我發現了一個錯誤，當512MB下載：

Traceback (most recent call last): 
    File "dl.py", line 34, in <module> 
    br.retrieve(br.retrieve(url_to_file, local_filename) 
    File "C:\Python27\lib\site-packages\mechanize\_opener.py", line 277, in retrieve 
    block = fp.read(bs) 
    File "C:\Python27\lib\site-packages\mechanize\_response.py", line 199, in read 
    self.__cache.write(data) 
MemoryError: out of memory

你有什麼想法如何解決這個問題？謝謝

來源

2016-09-29 Milan Skála

也許嘗試'請求'。 – user3041764

你必須使用機械化嗎？ –

不，只要文件下載完成，我不在乎它的完成方式。但是日誌記錄部分存在問題。如果有另一個模塊可以做到這一點，我願意接受。 –

您可以使用bs4和requests讓你登錄然後寫流內容。有幾個表格字段需要包括_token_字段，這是必需的：

from bs4 import BeautifulSoup 
import requests 
from urlparse import urljoin 

data = {'email': '[email protected]', 'password': 'password'} 
base = "https://support.codasip.com" 

with requests.Session() as s: 
    # update headers 
    s.headers.update({'User-agent': 'Firefox'}) 

    # use bs4 to parse the from fields 
    soup = BeautifulSoup(s.get(base).content) 
    form = soup.select_one("#frm-loginForm") 
    # works as it is a relative path. Not always the case. 
    action = form["action"] 

    # Get rest of the fields, ignore password and email. 
    for inp in form.find_all("input", {"name":True,"value":True}): 
     name, value = inp["name"], inp["value"] 
     if name not in data: 
      data[name] = value 
    # login 
    s.post(urljoin(base, action), data=data) 
    # get protected url 
    with open(local_filename, "wb") as f: 
     for chk in s.get(url_to_file, stream=True).iter_content(1024): 
      f.write(chk)

來源

2016-09-29 10:50:54

感謝您的回答，但似乎登錄無效，可能是因爲表單無法正確找到條目。我會盡力弄清楚這一點。 –

@MilanSkála，實際上我們需要一個令牌，我會在一分鐘內編輯 –

@MilanSkála，現在應該工作正常 –

-1

嘗試下載/編寫它的塊。看起來像文件需要你所有的記憶。

如果服務器支持它，您應該爲您的請求指定Range標頭。

https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

來源

2016-09-29 10:15:55 Alex

如何在Python 2中下載大文件

回答

相關問題