從jSoup的元素列表中刪除重複的URL？

使用jSoup刮取頁面時，可以使用頁面上的所有鏈接進行收集;從jSoup的元素列表中刪除重複的URL？

Elements allLinksOnPage = doc.select("a");

這很好。現在，如何從這個列表中刪除重複的URL？即想象一下在主導航欄中鏈接的/contact-us.html。

所有重複的URL都被刪除後，下一步是抓取這些唯一的URL並繼續循環。

關於這個問題的實用性的問題。代碼;

for (Element e : allLinksOnPage) { 
    String absUrl = e.absUrl("href"); 

    //Asbolute URL Starts with HTTP or HTTPS to make sure it isn't a mailto: link 
    if (absUrl.startsWith("http") || absUrl.startsWith("https")) { 
     //Check that the URL starts with the original domain name 
     if (absUrl.startsWith(getURL)) { 
      //Remove Duplicate URLs 
      //Not sure how to do this bit yet? 
      //Add New URLs found on Page to 'allLinksOnPage' to allow this 
      //For Loop to continue until the entire website has been scraped 
     } 
    } 
}

所以問題的存在，在循環的最後一部分，想象一下當頁面2.HTML爬網，多個URL標識在這裏，並添加到allLinksOnPage變量。

for循環會繼續執行完整列表的長度，即page-1.html上的10個鏈接和page-2.html上的10個鏈接，因此共20個網頁將被抓取 - 或者 - 請問循環僅繼續識別的前10個鏈接的長度，即代碼'for（Element e：allLinksOnPage）'之前的鏈接被觸發？

這一切都將不可避免地最終在數據庫中，一旦邏輯完成，但希望保持純粹基於Java的邏輯最初是爲了防止大量讀取/寫入到數據庫，這會減慢一切。

來源

2016-12-11 Michael Cropper

你可以使用一組來存儲URL，然後檢查每個URL是否已被處理。 btw'absUrl.startsWith（「http」）|| absUrl.startsWith（「https」）'是多餘的。你可以放棄'startswith（「https）'部分 –

謝謝@MadMatts是的，這是正確的。 –

allLinksOnPage只重複一次。你永遠不會檢索任何關於你找到的鏈接的信息。

但是，您可以使用Set和List。此外，您可以使用URL類爲您提取協議。

URL startUrl = ...; 
Set<String> addedPages = new HashSet<>(); 
List<URL> urls = new ArrayList<>(); 
addedPages.add(startUrl.toExternalForm()); 
urls.add(startUrl); 
while (!urls.isEmpty()) { 
    // retrieve url not yet crawled 
    URL url = urls.remove(urls.size()-1); 

    Document doc = JSoup.parse(url, TIMEOUT); 
    Elements allLinksOnPage = doc.select("a"); 
    for (Element e : allLinksOnPage) { 
     // add hrefs 
     URL absUrl = new URL(e.absUrl("href")); 

     switch (absUrl.getProtocol()) { 
      case "https": 
      case "http": 
       if (absUrl.toExternalForm().startsWith(getURL) && addedPages.add(absUrl.toExternalForm())) { 
        // add url, if not already added 
        urls.add(absUrl); 
       } 
     } 
    } 
}

來源

2016-12-11 14:44:56 fabian

這很完美，謝謝@fabian。我剛剛在各種網站上測試過，並注意到很多網站都有鏈接到帶有＃的網址，這些網址只是同一個網址的重複內容，我將爲這個網址開一個新的問題，以便如何對網址進行規範化處理，因爲它超出了這個問題的範圍。 –

從jSoup的元素列表中刪除重複的URL？

回答

相關問題