如何使用python和BeautifulSoup從網站下載.qrs文件？

我想下載所有以.qrs，.dat，.hea結尾的文件並將它們存儲到本網站的本地文件夾中。如何使用python和BeautifulSoup從網站下載.qrs文件？

https://physionet.org/physiobank/database/shareedb/

我試圖修改從下面的鏈接的解決方案。

https://stackoverflow.com/questions/34632838/download-xls-files-from-a-webpage-using-python-and-beautifulsoup

這是我修改了代碼：

import os 
from bs4 import BeautifulSoup 
# Python 3.x 
from urllib.request import urlopen, urlretrieve 

URL = 'https://physionet.org/physiobank/database/shareedb/' 
OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder 

u = urlopen(URL) 
try: 
    html = u.read().decode('utf-8') 
finally: 
    u.close() 

soup = BeautifulSoup(html, "html.parser") 
for link in soup.select('a[href^="https://"]'): # or a[href*="shareedb/0"] 
    href = link.get('href') 
    if not any(href.endswith(x) for x in ['.dat','.hea','.qrs']): 
     continue 

    filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1]) 

    # We need a https:// URL for this site 
    # href = href.replace('http://','https://') 

    print("Downloading %s to %s..." % (href, filename)) 
    urlretrieve(href, filename) 
    print("Done.")

當我運行這段代碼，它不提取從目標頁面的文件，也沒有輸出任何故障消息（例如「失敗去下載'）。

經過一些調試後，我看到在我的情況下，非文件被選中。我懷疑它必須做更多的HTML的結構。

如何使用Python將這些文件下載到本地目錄？

來源

2016-12-15 Molnia

你可以使用出色的requests庫如下：

import bs4    
import requests 

url = "https://physionet.org/physiobank/database/shareedb/" 
html = requests.get(url) 
soup = bs4.BeautifulSoup(html.text, "html.parser") 

for link in soup.find_all('a', href=True): 
    href = link['href'] 

    if any(href.endswith(x) for x in ['.dat','.hea','.qrs']): 
     print "Downloading '{}'".format(href) 
     remote_file = requests.get(url + href) 

     with open(href, 'wb') as f: 
      for chunk in remote_file.iter_content(chunk_size=1024): 
       if chunk: 
        f.write(chunk)

這一切.dat，.hea和.qrs文件下載到您的補償uter。

安裝使用標準：

pip install requests

注意，所有該URL的HREF中的已經是其形式適合於直接使用的文件名（因此沒有必要在此刻來解析掉任何/字符）。

來源

2016-12-15 09:37:44

我也試過你的解決方案，工作正常。你能解釋爲什麼它下載這些文件需要這麼長時間嗎？ – Molnia

這可能是服務器端的限制。 –

感謝您的乾淨的代碼。儘管@Teemu Risikko提供了一個非常好的迴應，但您的解決方案提供了一種不同的方法，但它的效率稍高一點，因爲它可以在更短的時間內下載文件。可以請你告訴我或猜爲什麼它更快，雖然你使用的是嵌套循環？ – Molnia

從你的症狀來看，可能的原因可能是沒有匹配的網址，那麼它不會進入循環。由於我使用python 2.7。我不驗證代碼。您可以嘗試打印您匹配的鏈接，然後檢查是否可以下載和提取網址。

來源

2016-12-15 09:16:15

確切說是的情況下，條件如果沒有任何（href.endswith（X）對於x在[ '的.dat'，'。 HEA ''。qrs']）：永遠不會滿足，循環會繼續回到起點。 – Molnia

要擴大狼天的答案，選擇沒有找到任何東西，因爲該網站中的鏈接在其href中沒有"https://"（也沒有"shareedb"）。您嘗試下載的所有文件的結構都是<a href="01911.hea">01911.hea</a>。他們的路徑是相對。所以，你需要做的是首先提取的文件名，例如像這樣：

for link in soup.select('a'): 
    href = link.get('href') 
    if not href or not any(href.endswith(x) for x in ['.dat','.hea','.qrs']): 
     continue 

    filename = os.path.join(OUTPUT_DIR, href)

然後你需要取回之前，主機部分適用於網址：

urlretreive(URL + href, filename)

來源

2016-12-15 09:33:44

使用這種方法，它下載的文件，但非常緩慢，我還必須添加如果href不是無： – Molnia

現在應該可以爲if部分。例如01911.dat的文件大小似乎是「僅」1.9MiB，但是當我嘗試直接從瀏覽器打開時，它也需要很長時間。 –

import requests 
from bs4 import BeautifulSoup 
from urllib.parse import urljoin 

start_url = 'https://physionet.org/physiobank/database/shareedb/' 
r = requests.get(start_url) 
soup = BeautifulSoup(r.text, 'lxml') 

# get full url of file 
pre = soup.find('pre') 
file_urls = pre.select('a[href*="."]') 
full_urls = [urljoin(start_url, url['href'])for url in file_urls] 
# download file 
for full_url in full_urls: 
    file_name = full_url.split('/')[-1] 
    print("Downloading {} to {}...".format(full_url, file_name)) 
    with open(file_name, 'wb') as f: 
     fr = requests.get(full_url, stream=True) 
     for chunk in fr.iter_content(chunk_size=1024): 
      f.write(chunk) 
    print('Done')

出：

Downloading https://physionet.org/physiobank/database/shareedb/01911.dat to 01911.dat... 
Done 
Downloading https://physionet.org/physiobank/database/shareedb/01911.hea to 01911.hea... 
Done 
Downloading https://physionet.org/physiobank/database/shareedb/01911.qrs to 01911.qrs... 
Done 
Downloading https://physionet.org/physiobank/database/shareedb/02012.dat to 02012.dat... 
Done 
Downloading https://physionet.org/physiobank/database/shareedb/02012.hea to 02012.hea... 
Done 
Downloading https://physionet.org/physiobank/database/shareedb/02012.qrs to 02012.qrs...

來源

2016-12-15 10:27:12

謝謝你的回答。 – Molnia

如何使用python和BeautifulSoup從網站下載.qrs文件？

回答

相關問題