tar文件無法打開TGZ

我試圖從本網站下載TGZ文件： https://plg.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/foo07 tar文件無法打開TGZ

這裏是我的腳本：

import os 
from six.moves import urllib 
import tarfile 

spam_path=os.path.join('ML', 'spam') 
root_download='https://plg.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/foo07' 
spam_url=root_download+'255 MB Corpus (trec07p.tgz)' 

if not os.path.isdir(spam_path): 
    os.makedirs(spam_path) 

path=os.path.join(spam_path, 'trec07p.tgz') 
if not os.path.isfile('trec07p.tgz'): 
    urllib.request.urlretrieve(spam_url,path) 
tar_file=tarfile.open(path)

我不知道我缺少什麼，但下面的錯誤返回：

--------------------------------------------------------------------------- 
ReadError         Traceback (most recent call last) 
<ipython-input-21-5644813e0670> in <module>() 
    18 if not os.path.isfile('trec07p.tgz'): 
    19  urllib.request.urlretrieve(spam_url,path) 
---> 20 tar_file=tarfile.open(path) 
    21 # tar_file.extractall(path) 
    22 # tar_file.close() 

/anaconda/lib/python2.7/tarfile.pyc in open(cls, name, mode, fileobj, bufsize, **kwargs) 
    1678       fileobj.seek(saved_pos) 
    1679      continue 
-> 1680    raise ReadError("file could not be opened successfully") 
    1681 
    1682   elif ":" in mode: 

ReadError: file could not be opened successfully

預先感謝您的幫助！

來源

2017-10-09 A.E

您可以將其他參數添加到tarfile.open。您需要將模式設置爲'r:gz'。

tarfile.open(path, 'r:gz')

工作實例後接受協議：

import tarfile 

import requests 

URL = 'https://plg.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/trec07p.tgz' 
FILE = '/home/blake/Downloads/trec07p.tgz' 

resp = requests.get(URL, stream=True) 
resp.raise_for_status() 

with open(FILE, 'wb') as out_file: 
    for line in resp.iter_content(chunk_size=1024*4, decode_unicode=False): 
     out_file.write(line) 


f = tarfile.open(FILE, 'r:gz') 
print(f.getnames()) 

f.close()

輸出：

['trec07p/data/inmail.35059', 
'trec07p/data/inmail.34430', 
'trec07p/data/inmail.45722', 
.. 
..]

來源

2017-10-09 17:06:24 blakev

我嘗試了所有方式 'R'， 'R：x'， 'R：'，' r：gz'，'r：bz2'但它們全部重新調出錯誤 –

它不起作用，因爲該網站不允許您自動下載他們的語料庫 - 您必須單擊協議。您的網址也是虛假的。請參閱編輯。 – blakev

tar文件無法打開TGZ

回答

相關問題