我想從Tax Foundation網站上刮取'.xlsx'文件。可悲的是我不斷收到一條錯誤消息:Excel cannot open the file '2017-FF-For-Website-7-10-2017.xlsx because the file format or file extension is not valid. verify that the file has not been corrupted and that the file extension matches the format of the file
。我做了一些研究,它說解決這個問題的方法是將文件擴展名改爲'.xls'而不是'.xlsx'。誰能幫忙?如何更改文件擴展名?
from bs4 import BeautifulSoup
import urllib.request
import os
url = urllib.request.urlopen("https://taxfoundation.org/facts-figures-2017/")
soup = BeautifulSoup(url, from_encoding=url.info().get_param('charset'))
FHFA = os.chdir('C:/US_Census/Directory')
seen = set()
for link in soup.find_all('a', href=True):
href = link.get('href')
if not any(href.endswith(x) for x in ['.xlsx']):
continue
file = href.split('/')[-1]
filename = file.rsplit('.', 1)[0]
if filename not in seen: # only retrieve file if it has not been seen before
seen.add(filename) # add the file to the set
url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)
print(filename)
print(' ')
print("All files successfully downloaded.")
P.S.我知道你可以下載這個文件,但是我在網上抓取它來自動化一個特定的過程。
您使用的是什麼版本的Python? – TheDetective
這個循環語句(如果沒有)['.xlsx'])中的href.endswith(x)'在['.xlsx']'中運行一次',然後檢查是否有'href.endswith 'XLSX')'。你基本上可以用'如果不是href.endswith('。xlsx')'這個簡單一些'縮短這個。 – Vinny
我正在使用Python 3.6 @TheDetective – bhammer