如何更改文件擴展名？

我想從Tax Foundation網站上刮取'.xlsx'文件。可悲的是我不斷收到一條錯誤消息：Excel cannot open the file '2017-FF-For-Website-7-10-2017.xlsx because the file format or file extension is not valid. verify that the file has not been corrupted and that the file extension matches the format of the file。我做了一些研究，它說解決這個問題的方法是將文件擴展名改爲'.xls'而不是'.xlsx'。誰能幫忙？如何更改文件擴展名？

from bs4 import BeautifulSoup 
import urllib.request 
import os 

url = urllib.request.urlopen("https://taxfoundation.org/facts-figures-2017/") 

soup = BeautifulSoup(url, from_encoding=url.info().get_param('charset')) 

FHFA = os.chdir('C:/US_Census/Directory') 

seen = set() 
for link in soup.find_all('a', href=True): 
    href = link.get('href') 
    if not any(href.endswith(x) for x in ['.xlsx']): 
     continue 

    file = href.split('/')[-1] 
    filename = file.rsplit('.', 1)[0] 
    if filename not in seen: # only retrieve file if it has not been seen before 
     seen.add(filename) # add the file to the set 
     url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file) 
    print(filename) 

print(' ') 
print("All files successfully downloaded.")

P.S.我知道你可以下載這個文件，但是我在網上抓取它來自動化一個特定的過程。

來源

2017-08-04 bhammer

您使用的是什麼版本的Python？ – TheDetective

這個循環語句（如果沒有）['.xlsx']）中的href.endswith（x）'在['.xlsx']'中運行一次'，然後檢查是否有'href.endswith 'XLSX'）'。你基本上可以用'如果不是href.endswith（'。xlsx'）'這個簡單一些'縮短這個。 – Vinny

我正在使用Python 3.6 @TheDetective – bhammer

你的問題是你的問題是你的url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)線。如果您訪問網站並將鼠標懸停在Excel下載按鈕上，則會看到鏈接更長，https://files.taxfoundation.org/20170710170238/2017-FF-For-Website-7-10-2017.xlsx（請注意2017....238？）。所以你永遠不會正確下載Excel文件。下面是正確的行這樣做：

url = urllib.request.urlretrieve(href, file)

一切是正常工作。

來源

2017-08-04 13:53:06 TheDetective

真棒它是固定的，謝謝！ – bhammer

不客氣！我也使用了Vinny的建議，而且它仍然有效。 – TheDetective

如何更改文件擴展名？

回答

相關問題