我想自動從page下載文件。抓取:從網址下載文件
我試過很多方法,如:
download.file
read.table
GET
但沒有成功。我不是在要求代碼,但我要求提供任何暗示/想法來處理這種情況。
我想自動從page下載文件。抓取:從網址下載文件
我試過很多方法,如:
download.file
read.table
GET
但沒有成功。我不是在要求代碼,但我要求提供任何暗示/想法來處理這種情況。
使用BeautifulSoup
的Python版本。
try:
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
except ImportError:
# Python 2.x
from urllib import urlopen, urlretrieve, quote
from urlparse import urljoin
from bs4 import BeautifulSoup
url = 'http://oilandgas.ky.gov/Pages/ProductionReports.aspx'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
try:
urlretrieve(href, filename)
except:
print('failed to download')
感謝這個解決方案。 'BeautifulSoup'真的很美。 – agstudy
這個工作對我來說:
getIt = function(what,when){
url=paste0("http://oilandgas.ky.gov/Production%20Reports%20Library/",
when,"%20-%20",what,
"%20Production.xls")
destfile=paste0("/tmp/",what,when,".xls")
download.file(url,destfile)
}
例如:
> getIt("gas",2006)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2006%20-%20gas%20Production.xls'
Content type 'application/vnd.ms-excel' length 3490304 bytes (3.3 Mb)
opened URL
==================================================
downloaded 3.3 Mb
除了第一個:
> getIt("oil",2010)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls'
Error in download.file(url, destfile) :
cannot open URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls'
In addition: Warning message:
In download.file(url, destfile) :
cannot open: HTTP status was '404 NOT FOUND'
雖然我可以得到2010的氣體數據:
> getIt("gas",2010)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20gas%20Production.xls'
Content type 'application/vnd.ms-excel' length 4177408 bytes (4.0 Mb)
opened URL
==================================================
downloaded 4.0 Mb
因此,它看起來像他們改變了這一個鏈接的系統。您可以通過以下鏈接獲取該數據,然後在cruddy Sharepoint HTML中查找下載鏈接。
這就是爲什麼我們討厭Sharepoint,kiddies。
使用Python,常用的方法是使用[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) –
爲什麼不下載.file工作?適用於我。 – Spacedman
@Spacedman你能告訴我這個嗎?也許我想念什麼? – agstudy