2015-09-14 7 views
0

我用wget從互聯網上下載一個文件,並使用-O選項保存圖像自定義文件名。有時,找不到該文件,返回404錯誤代碼。例如,我運行以下命令:wget的 - 如何跳過未找到文件?

wget 'http://www.example.com/path/to/image/file01928.jpg' -O myimagefile.jpg 

結果是

[email protected]:~# wget 'http://www.example.com/path/to/image/file01928.jpg' -O myimagefile.jpg 
--2015-09-13 23:11:07-- http://www.example.com/path/to/image/file01928.jpg 
Resolving www.example.com (www.example.com)... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946 
Connecting to www.example.com (www.example.com)|93.184.216.34|:80... connected. 
HTTP request sent, awaiting response... 404 Not Found 
2015-09-13 23:11:07 ERROR 404: Not Found. 

儘管文件中沒有找到,但該文件仍然保存在我的硬盤:

[email protected]:~# ls 
myimagefile.jpg 

有一種跳過/取消(不執行命令)未找到文件的方法?我應該使用哪些選項?

+0

你想wget的不若運行本地文件存在? – xxfelixxx

+0

@xxfelixxx我的意思是,我有需要下載數百個圖像的URL列表。此列表是在4個月前創建的。幾個這些圖像的URL不存在了(已過期域,圖像文件被刪除等)。我不想下載「未找到」的文件,只有有效的文件/ URL應該被下載。有沒有辦法做到這一點? – user3195859

回答

0

您可以執行HEAD請求以查看資源(圖像)是否存在,如果存在,則下載它。你可以用-S運行wget的打印頭,並--spider檢查,但不能下載的資源。

man wget

-S 
    --server-response 
     Print the headers sent by HTTP servers and responses sent by FTP servers. 

    --spider 
     When invoked with this option, Wget will behave as a Web spider, which means that 
     it will not download the pages, just check that they are there. For example, you 
     can use Wget to check your bookmarks: 

       wget --spider --force-html -i bookmarks.html 

     This feature needs much more work for Wget to get close to the functionality of 
     real web spiders. 

下面是一個例子:

#!/bin/bash 

URL='http://www.google.com' 
echo "Checking $URL" 
if wget -S --spider $URL 2>&1 | grep -q 'Remote file exists'; then 
    echo "Found $URL, going to fetch it" 
    wget $URL -O google.html; 
else 
    echo 'Url $URL does not exist!' 
fi 

URL='http://www.example.com/path/to/image/file01928.jpg' 
echo "Checking $URL" 
if wget -S --spider $URL 2>&1 | grep -q 'Remote file exists'; then 
    echo "Found $URL, going to fetch it" 
    wget $URL -O myimagefile.jpg; 
else 
    echo "Url $URL does not exist!" 
fi 

輸出

Checking http://www.google.com 
Found http://www.google.com, going to fetch it 
--2015-09-14 05:26:34-- http://www.google.com/ 
Resolving www.google.com (www.google.com)... 74.125.239.144, 74.125.239.145, 74.125.239.146, ... 
Connecting to www.google.com (www.google.com)|74.125.239.144|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: unspecified [text/html] 
Saving to: ‘google.html’ 

    [ <=>             ] 18,684  --.-K/s in 0.001s 

2015-09-14 05:26:34 (13.9 MB/s) - ‘google.html’ saved [18684] 

Checking http://www.example.com/path/to/image/file01928.jpg 
Url http://www.example.com/path/to/image/file01928.jpg does not exist! 
相關問題