Python：urlretrieve PDF下載

我在Python中使用urllib的urlretrieve（）函數來嘗試從網站獲取一些pdf。它（至少對我而言）停止工作，正在下載損壞的數據（15 KB而不是164 KB）。Python：urlretrieve PDF下載

我用幾個pdf測試過了，都沒有成功（即random.pdf）。我似乎無法使其工作，並且我需要能夠爲我正在處理的項目下載pdf。

這裏是我使用的下載PDF格式的（和分析使用pdftotext.exe文本）的那種代碼的例子：我是新手程序員

def get_html(url): # gets html of page from Internet 
    import os 
    import urllib2 
    import urllib 
    from subprocess import call 
    f_name = url.split('/')[-2] # get file name (url must end with '/') 
    try: 
     if f_name.split('.')[-1] == 'pdf': # file type 
      urllib.urlretrieve(url, os.getcwd() + '\\' + f_name) 
      call([os.getcwd() + '\\pdftotext.exe', os.getcwd() + '\\' + f_name]) # use xpdf to output .txt file 
      return open(os.getcwd() + '\\' + f_name.split('.')[0] + '.txt').read() 
     else: 
      return urllib2.urlopen(url).read() 
    except: 
     print 'bad link: ' + url  
     return ""

，所以任何輸入將是巨大的！謝謝

來源

2013-02-03 hisroar

我會建議嘗試requests。這是一個非常好的庫，它隱藏了一個簡單的API後面的所有實現。

>>> import requests 
>>> req = requests.get("http://www.mathworks.com/moler/random.pdf") 
>>> len(req.content) 
167633 
>>> req.headers 
{'content-length': '167633', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.3 (Red Hat) mod_jk/1.2.31 PHP/5.3.13 Phusion_Passenger/3.0.9 mod_perl/2.0.4 Perl/v5.8.8', 'last-modified': 'Fri, 15 Feb 2008 17:11:12 GMT', 'connection': 'keep-alive', 'etag': '"30863b-28ed1-446357e3d4c00"', 'date': 'Sun, 03 Feb 2013 05:53:21 GMT', 'content-type': 'application/pdf'}

順便說一句，你只得到15kb下載的原因是因爲你的網址是錯誤的。它應該是

http://www.mathworks.com/moler/random.pdf

但你歌廳

http://www.mathworks.com/moler/random.pdf/ 

>>> import requests 
>>> c = requests.get("http://www.mathworks.com/moler/random.pdf/") 
>>> len(c.content) 
14390

來源

2013-02-03 05:54:32 sberry

哇，這似乎很奇怪，謝謝你告訴我有關請求。 – hisroar

將文件寫入到光盤：

myfile = open("out.pdf", "w") 
myfile.write(req.content)

來源

2015-06-27 19:08:45 user1767754

試圖做到這一點，我得到的是一個難以理解的.pdf任何想法？ –

也許它有點晚了，但你可以試試這個：只是寫將內容添加到一個新文件並使用textract讀取它，因爲沒有它，給了我不想要的包含'＃$'的文本。

import requests 
import textract 
url = "The url which downloads the file" 
response = requests.get(url) 
with open('./document.pdf', 'wb') as fw: 
    fw.write(response.content) 
text = textract.process("./document.pdf") 
print('Result: ', text)

來源

2017-05-30 10:21:19 Arjunsingh

Python：urlretrieve PDF下載

回答

相關問題