2013-04-18 91 views
1

我正在使用Python 3.3.1。我創建了一個名爲download_file()的函數,該函數下載文件並將其保存到磁盤。爲什麼不下載文本文件正常工作?

#!/usr/bin/python3 
# -*- coding: utf8 -*- 

import datetime 
import os 
import urllib.error 
import urllib.request 


def download_file(*urls, download_location=os.getcwd(), debugging=False): 
    """Downloads the files provided as multiple url arguments. 

    Provide the url for files to be downloaded as strings. Separate the 
    files to be downloaded by a comma. 

    The function would download the files and save it in the folder 
    provided as keyword-argument for download_location. If 
    download_location is not provided, then the file would be saved in 
    the current working directory. Folder for download_location would be 
    created if it doesn't already exist. Do not worry about trailing 
    slash at the end for download_location. The code would take carry of 
    it for you. 

    If the download encounters an error it would alert about it and 
    provide the information about the Error Code and Error Reason (if 
    received from the server). 

    Normal Usage: 
    >>> download_file('http://localhost/index.html', 
         'http://localhost/info.php') 
    >>> download_file('http://localhost/index.html', 
         'http://localhost/info.php', 
         download_location='/home/aditya/Download/test') 
    >>> download_file('http://localhost/index.html', 
         'http://localhost/info.php', 
         download_location='/home/aditya/Download/test/') 

    In Debug Mode, files are not downloaded, neither there is any 
    attempt to establish the connection with the server. It just prints 
    out the filename and its url that would have been attempted to be 
    downloaded in Normal Mode. 

    By Default, Debug Mode is inactive. In order to activate it, we 
    need to supply a keyword-argument as 'debugging=True', like: 
    >>> download_file('http://localhost/index.html', 
         'http://localhost/info.php', 
         debugging=True) 
    >>> download_file('http://localhost/index.html', 
         'http://localhost/info.php', 
         download_location='/home/aditya/Download/test', 
         debugging=True) 

    """ 
    # Append a trailing slash at the end of download_location if not 
    # already present 
    if download_location[-1] != '/': 
     download_location = download_location + '/' 

    # Create the folder for download_location if not already present 
    os.makedirs(download_location, exist_ok=True) 

    # Other variables 
    time_format = '%Y-%b-%d %H:%M:%S' # '2000-Jan-01 22:10:00' 

    # "Request Headers" information for the file to be downloaded 
    accept = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' 
    accept_encoding = 'gzip, deflate' 
    accept_language = 'en-US,en;q=0.5' 
    connection = 'keep-alive' 
    user_agent = 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:20.0) \ 
        Gecko/20100101 Firefox/20.0' 
    headers = {'Accept': accept, 
       'Accept-Encoding': accept_encoding, 
       'Accept-Language': accept_language, 
       'Connection': connection, 
       'User-Agent': user_agent, 
       } 

    # Loop through all the files to be downloaded 
    for url in urls: 
     filename = os.path.basename(url) 
     if not debugging: 
      try: 
       request_sent = urllib.request.Request(url, None, headers) 
       response_received = urllib.request.urlopen(request_sent) 
      except urllib.error.URLError as error_encountered: 
       print(datetime.datetime.now().strftime(time_format), 
         ':', filename, '- The file could not be downloaded.') 
       if hasattr(error_encountered, 'code'): 
        print(' ' * 22, 'Error Code -', error_encountered.code) 
       if hasattr(error_encountered, 'reason'): 
        print(' ' * 22, 'Reason -', error_encountered.reason) 
      else: 
       read_response = response_received.read() 
       output_file = download_location + filename 
       with open(output_file, 'wb') as downloaded_file: 
        downloaded_file.write(read_response) 
       print(datetime.datetime.now().strftime(time_format), 
         ':', filename, '- Downloaded successfully.') 
     else: 
      print(datetime.datetime.now().strftime(time_format), 
        ': Debugging :', filename, 'would be downloaded from :\n', 
        ' ' * 21, url) 

此功能適用於下載PDF文件,圖像和其他格式,但它給文本文件如html文件帶來麻煩。我懷疑這個問題必須做一些與此行結尾:

with open(output_file, 'wb') as downloaded_file: 

所以,我曾試圖wt模式下打開它。也嘗試僅使用w模式。但是這並不能解決問題。

另一個問題可能已經被編碼,所以我也包含第二行:

# -*- coding: utf8 -*- 

但是,這仍然無法正常工作。可能是什麼問題,以及如何使它適用於文本和二進制文件?什麼不起作用

例子:

>>>download_file("http://docs.python.org/3/tutorial/index.html") 

當我Gedit的打開它,它顯示爲:

在Firefox打開時

in gedit

同理:

in firefox

+1

究竟是什麼問題/錯誤? –

+0

@StephaneRolland:它不會給出任何錯誤。但是,當我在文本編輯器中打開文檔時,它會報告有關編碼的問題。我會在一會兒上傳圖片.. – Aditya

+0

哪個文本編輯器? –

回答

2

該文件你正在下載已經用gzip編碼發送 - 你可以看到,如果你zcat index.html,下載的文件顯示正確。在代碼中,你可能需要添加類似:

if response_received.headers.get('Content-Encoding') == 'gzip': 
    read_response = zlib.decompress(read_response, 16 + zlib.MAX_WBITS) 

編輯:

好了,我不能說,爲什麼它在Windows(不幸的是我沒有Windows中測試它),但如果你發佈響應的轉儲(即將響應對象轉換爲字符串),這可能會提供一些洞察。據推測,服務器選擇不使用gzip編碼進行發送,但考慮到該代碼對頭文件非常明確,我不確定會有什麼不同。

值得一提的是,您的標頭明確指定允許gzip和deflate(請參閱accept_encoding)。如果你刪除了這個頭部,你不必擔心在任何情況下解壓縮響應。

+0

你如何解釋它在Windows 7下的計算機中完美工作? –

+0

這工作。不過,我還想解釋爲什麼它在Windows中工作。同時,我會嘗試其他排列和組合,如更改標題和其他內容。 – Aditya