抓取：從網址下載文件

我試過很多方法，如：

download.file 
read.table 
GET

但沒有成功。我不是在要求代碼，但我要求提供任何暗示/想法來處理這種情況。

來源

2014-01-08 agstudy

使用Python，常用的方法是使用[BeautifulSoup]（http://www.crummy.com/software/BeautifulSoup/） –

爲什麼不下載.file工作？適用於我。 – Spacedman

@Spacedman你能告訴我這個嗎？也許我想念什麼？ – agstudy

使用BeautifulSoup的Python版本。

try: 
    # Python 3.x 
    from urllib.request import urlopen, urlretrieve, quote 
    from urllib.parse import urljoin 
except ImportError: 
    # Python 2.x 
    from urllib import urlopen, urlretrieve, quote 
    from urlparse import urljoin 

from bs4 import BeautifulSoup 

url = 'http://oilandgas.ky.gov/Pages/ProductionReports.aspx' 
u = urlopen(url) 
try: 
    html = u.read().decode('utf-8') 
finally: 
    u.close() 

soup = BeautifulSoup(html) 
for link in soup.select('div[webpartid] a'): 
    href = link.get('href') 
    if href.startswith('javascript:'): 
     continue 
    filename = href.rsplit('/', 1)[-1] 
    href = urljoin(url, quote(href)) 
    try: 
     urlretrieve(href, filename) 
    except: 
     print('failed to download')

來源

2014-01-08 04:39:57 falsetru

感謝這個解決方案。 'BeautifulSoup'真的很美。 – agstudy

這個工作對我來說：

getIt = function(what,when){ 
    url=paste0("http://oilandgas.ky.gov/Production%20Reports%20Library/", 
       when,"%20-%20",what, 
       "%20Production.xls") 
    destfile=paste0("/tmp/",what,when,".xls") 
    download.file(url,destfile) 
}

例如：

> getIt("gas",2006) 
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2006%20-%20gas%20Production.xls' 
Content type 'application/vnd.ms-excel' length 3490304 bytes (3.3 Mb) 
opened URL 
================================================== 
downloaded 3.3 Mb

除了第一個：

> getIt("oil",2010) 
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls' 
Error in download.file(url, destfile) : 
    cannot open URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls' 
In addition: Warning message: 
In download.file(url, destfile) : 
    cannot open: HTTP status was '404 NOT FOUND'

雖然我可以得到2010的氣體數據：

> getIt("gas",2010) 
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20gas%20Production.xls' 
Content type 'application/vnd.ms-excel' length 4177408 bytes (4.0 Mb) 
opened URL 
================================================== 
downloaded 4.0 Mb

因此，它看起來像他們改變了這一個鏈接的系統。您可以通過以下鏈接獲取該數據，然後在cruddy Sharepoint HTML中查找下載鏈接。

這就是爲什麼我們討厭Sharepoint，kiddies。

來源

2014-01-08 11:34:59 Spacedman

+1我也討厭它共享點，但它無處不在:)你可以看到我的嘗試，它甚至可以用你的第一個文件。再次抱歉，以前不顯示我的嘗試（技術原因）。 – agstudy

他們改爲使用** xlsx **而不是** xls **作爲第一個。 – Spacedman

抓取：從網址下載文件

回答

相關問題