2016-08-30 103 views
0

我正在嘗試編寫一個腳本,它將通過csv文件中的着陸頁網址列表進行迭代,將着陸頁上的所有PDF鏈接追加到列表中,然後迭代將PDF下載到指定文件夾的列表。從美麗的湯刮下的鏈接下載PDF

我有點卡在最後一步 - 我可以得到所有的PDF網址,但只能單獨下載。我不知道如何最好地修改目錄地址,以改變每個網址,以確保每個網址都有自己獨特的文件名。

任何幫助,將不勝感激!

from bs4 import BeautifulSoup, SoupStrainer 
import requests 
import re 

#example url 
url = "https://beta.companieshouse.gov.uk/company/00445790/filing-history" 
link_list = [] 
r = requests.get(url) 
soup = BeautifulSoup(r.content, "lxml") 

for a in soup.find_all('a', href=True): 
    if "document" in a['href']: 
     link_list.append("https://beta.companieshouse.gov.uk"+a['href']) 

for url in link_list: 

    response = requests.get(url) 

    with open('C:/Users/Desktop/CompaniesHouse/report.pdf', 'wb') as f: 
     f.write(response.content) 

回答

0

最簡單的事情是用枚舉數只添加到每個文件名:

for ind, url in enumerate(link_list, 1): 
    response = requests.get(url) 

    with open('C:/Users/Desktop/CompaniesHouse/report_{}.pdf'.format(ind), 'wb') as f: 
     f.write(response.content) 

但假設每個路徑somne​​_filename.pdf結束,它們是唯一可以使用基名本身可能更具描述性:

from os.path import basename, join 
for url in link_list: 
    response = requests.get(url) 
    with open(join('C:/Users/Desktop/CompaniesHouse", basename(url)), 'wb') as f: 
     f.write(response.content)