從美麗的湯刮下的鏈接下載PDF

我正在嘗試編寫一個腳本，它將通過csv文件中的着陸頁網址列表進行迭代，將着陸頁上的所有PDF鏈接追加到列表中，然後迭代將PDF下載到指定文件夾的列表。從美麗的湯刮下的鏈接下載PDF

我有點卡在最後一步 - 我可以得到所有的PDF網址，但只能單獨下載。我不知道如何最好地修改目錄地址，以改變每個網址，以確保每個網址都有自己獨特的文件名。

任何幫助，將不勝感激！

from bs4 import BeautifulSoup, SoupStrainer 
import requests 
import re 

#example url 
url = "https://beta.companieshouse.gov.uk/company/00445790/filing-history" 
link_list = [] 
r = requests.get(url) 
soup = BeautifulSoup(r.content, "lxml") 

for a in soup.find_all('a', href=True): 
    if "document" in a['href']: 
     link_list.append("https://beta.companieshouse.gov.uk"+a['href']) 

for url in link_list: 

    response = requests.get(url) 

    with open('C:/Users/Desktop/CompaniesHouse/report.pdf', 'wb') as f: 
     f.write(response.content)

來源

2016-08-30 hlbau

最簡單的事情是用枚舉數只添加到每個文件名：

for ind, url in enumerate(link_list, 1): 
    response = requests.get(url) 

    with open('C:/Users/Desktop/CompaniesHouse/report_{}.pdf'.format(ind), 'wb') as f: 
     f.write(response.content)

但假設每個路徑somne_filename.pdf結束，它們是唯一可以使用基名本身可能更具描述性：

from os.path import basename, join 
for url in link_list: 
    response = requests.get(url) 
    with open(join('C:/Users/Desktop/CompaniesHouse", basename(url)), 'wb') as f: 
     f.write(response.content)

來源

2016-08-30 21:28:57

從美麗的湯刮下的鏈接下載PDF

回答

相關問題