Beautifulsoup下載Google專利搜索中的所有.zip文件

-1

我想要做的是使用Beautifulsoup從谷歌專利檔案中下載每個zip文件。以下是我迄今爲止編寫的代碼。但似乎我有麻煩讓文件下載到我的桌面上的目錄。任何幫助將不勝感激Beautifulsoup下載Google專利搜索中的所有.zip文件

from bs4 import BeautifulSoup 
import urllib2 
import re 
import pandas as pd 

url = 'http://www.google.com/googlebooks/uspto-patents-grants.html' 

site = urllib2.urlopen(url) 
html = site.read() 
soup = BeautifulSoup(html) 
soup.prettify() 

path = open('/Users/username/Desktop/', "wb") 

for name in soup.findAll('a', href=True): 
    print name['href'] 
    linkpath = name['href'] 
    rq = urllib2.request(linkpath) 
    res = urllib2.urlope

我應該得到的結果是，所有的zip文件都應該下載到一個特定的目錄。相反，我收到以下錯誤：

> #2015 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) 
> <ipython-input-13-874f34e07473> in <module>() 17 print name['href'] 18 
> linkpath = name['href'] ---> 19 rq = urllib2.request(namep) 20 res = 
> urllib2.urlopen(rq) 21 path.write(res.read()) AttributeError: 'module' 
> object has no attribute 'request' –

來源

2015-04-23 icomefromchaos

您遇到什麼麻煩？預期的結果是什麼？會發生什麼呢？ –

它應該下載所有的zip文件，但是我得到這個錯誤。＃2015 ----------------------------- ---------------------------------------------- AttributeError Traceback（）（） 17 print name ['href'] 18 linkpath = name ['href'] ---> 19 rq = urllib2.request（ namep） 20 res = urllib2.urlopen（rq） 21 path.write（res.read（）） AttributeError：'module'object has no attribute'request' – icomefromchaos

除了使用一個不存在的請求實體從的urllib2 ，您不會正確輸出到文件 - 您不能只打開目錄，您必須單獨打開每個文件以輸出。

另外，'請求'包比urllib2有更好的接口。我建議安裝它。

請注意，今天無論如何，第一個.zip是5.7Gb，所以流式傳輸到一個文件是必不可少的。

真的，你想要更多的東西是這樣的：

from BeautifulSoup import BeautifulSoup 
import requests 

# point to output directory 
outpath = 'D:/patent_zips/' 
url = 'http://www.google.com/googlebooks/uspto-patents-grants.html' 
mbyte=1024*1024 

print 'Reading: ', url 
html = requests.get(url).text 
soup = BeautifulSoup(html) 

print 'Processing: ', url 
for name in soup.findAll('a', href=True): 
    zipurl = name['href'] 
    if(zipurl.endswith('.zip')): 
     outfname = outpath + zipurl.split('/')[-1] 
     r = requests.get(zipurl, stream=True) 
     if(r.status_code == requests.codes.ok) : 
      fsize = int(r.headers['content-length']) 
      print 'Downloading %s (%sMb)' % (outfname, fsize/mbyte) 
      with open(outfname, 'wb') as fd: 
       for chunk in r.iter_content(chunk_size=1024): # chuck size can be larger 
        if chunk: # ignore keep-alive requests 
         fd.write(chunk) 
       fd.close()

來源

2015-04-30 18:18:09 JohnH

JohnH ...謝謝！這正是我正在尋找的。 – icomefromchaos

這是你的問題：

rq = urllib2.request(linkpath)

的urllib2是一個模塊，它在它沒有請求實體/屬性。

我看到的urllib2一個請求類，但我不能確定這就是你打算實際使用...

來源

2015-04-24 00:48:10

Beautifulsoup下載Google專利搜索中的所有.zip文件

回答

相關問題