鏡像整個網站並保存txt文件中的鏈接

是否可以使用wget鏡像來保存整個網站的所有鏈接並將它們保存在txt文件中？鏡像整個網站並保存txt文件中的鏈接

如果可能，它是如何完成的？如果沒有，是否有其他方法可以做到這一點？

編輯：

我試圖運行這個命令：

wget -r --spider example.com

，得到了這樣的結果：

Spider mode enabled. Check if remote file exists. 
--2015-10-03 21:11:54-- http://example.com/ 
Resolving example.com... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946 
Connecting to example.com|93.184.216.34|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 1270 (1.2K) [text/html] 
Remote file exists and could contain links to other resources -- retrieving. 

--2015-10-03 21:11:54-- http://example.com/ 
Reusing existing connection to example.com:80. 
HTTP request sent, awaiting response... 200 OK 
Length: 1270 (1.2K) [text/html] 
Saving to: 'example.com/index.html' 

100%[=====================================================================================================>] 1,270  --.-K/s in 0s  

2015-10-03 21:11:54 (93.2 MB/s) - 'example.com/index.html' saved [1270/1270] 

Removing example.com/index.html. 

Found no broken links. 

FINISHED --2015-10-03 21:11:54-- 
Total wall clock time: 0.3s 
Downloaded: 1 files, 1.2K in 0s (93.2 MB/s) 

(Yes, I also tried using other websites with more internal links)

來源

2015-10-03 user1878980

是的，這是它應該如何工作。實際網站「example.com」沒有內部鏈接，所以它只是返回自己。嘗試一個網站鏈接到網站內的其他網頁，你應該得到更多。你是否也想要鏈接到* external *網站？如果是這樣，來自@Randomazer的python腳本可能是一個更好的選擇。 – seumasmac

其實，有一個類似的問題，你可以在：http://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only哪些可能是有用的。 – seumasmac

非常感謝！這有幫助！ – user1878980

是，使用wget的--spider選項。一個命令如：

wget -r --spider example.com

將獲得所有鏈接的深度爲5（默認值）。然後，您可以將輸出捕獲到一個文件中，也許可以隨時清理它。喜歡的東西：

wget -r --spider example.com 2>&1 | grep "http://" | cut -f 4 -d " " >> weblinks.txt

會把剛剛鏈接到weblinks.txt文件（如果您的wget的版本有略微不同的輸出，你可能需要調整該命令一點點）。

來源

2015-10-03 18:37:50 seumasmac

好的，謝謝。我試圖複製你寫的腳本，但它並沒有起作用。它創建了一個weblinks.txt文件，但它只在.txt文件中保存了http://www.example.com（我試圖輸入其他網站）。也許我需要調整它，問題是我不知道如何。 – user1878980

你可以運行第一個命令，看看它給出了什麼輸出？請注意，通過遵循您提供的頁面上的鏈接，找出其他頁面的唯一方法就是找到它。如果沒有任何其他頁面的鏈接，它將不會找到其他任何內容。 – seumasmac

在這些評論中添加詳細信息很困難，因此您可能會發現更新您的問題更容易，其中詳細介紹了您嘗試的內容。 – seumasmac

或者使用python：

的exaple

import urllib, re 

def do_page(url): 
    f = urllib.urlopen(url) 
    html = f.read() 
    pattern = r"'{}.*.html'".format(url) 
    hits = re.findall(pattern, html) 
    return hits 

if __name__ == '__main__': 
    hits = [] 
    url = 'http://thehackernews.com/' 
    hits.extend(do_page(url)) 
    with open('links.txt', 'wb') as f1: 
     for hit in hits: 
      f1.write(hit)

日期：

'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/adblock-extension.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/data-breach-hacking.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/p/authors.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/09/digital-india-facebook.html' 
'http://thehackernews.com/2015/09/digital-india-facebook.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/10/buy-google-domain.html' 
'http://thehackernews.com/2015/09/winrar-vulnerability.html' 
'http://thehackernews.com/2015/09/winrar-vulnerability.html' 
'http://thehackernews.com/2015/09/chip-mini-computer.html' 
'http://thehackernews.com/2015/09/chip-mini-computer.html' 
'http://thehackernews.com/2015/09/edward-snowden-twitter.html' 
'http://thehackernews.com/2015/09/edward-snowden-twitter.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 
'http://thehackernews.com/2015/09/quantum-teleportation-data.html' 
'http://thehackernews.com/2015/09/quantum-teleportation-data.html' 
'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html' 
'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html' 
'http://thehackernews.com/2015/09/xor-ddos-attack.html' 
'http://thehackernews.com/2015/09/xor-ddos-attack.html' 
'http://thehackernews.com/2015/09/truecrypt-encryption-software.html' 
'http://thehackernews.com/2015/09/truecrypt-encryption-software.html'

來源

2015-10-03 19:01:59 Randomazer

鏡像整個網站並保存txt文件中的鏈接

回答

相關問題