使用jsoup獲取URL的子鏈接

考慮一個URl www.example.com它可能有大量的鏈接，有些可能是內部的，其他的可能是外部的。我想獲得所有子鏈接的列表，而不是甚至是子子鏈接，但只有子鏈接。例如，如果有四個環節如下使用jsoup獲取URL的子鏈接

1)www.example.com/images/main 
2)www.example.com/data 
3)www.example.com/users 
4)www.example.com/admin/data

然後出了四隻2和3的是使用，因爲它們是子鏈接不能細分子等環節。是有辦法實現它通過j湯。如果這不能通過j湯實現，那麼可以向我介紹一些其他的java API。還要注意，它應該是最初發送的父Url的鏈接（即www.example.com）

來源

2017-03-27 java fan

如果我能理解一個子鏈接可以包含一個斜槓，您可以嘗試用此計數數字的斜線例如：

List<String> list = new ArrayList<>(); 
list.add("www.example.com/images/main"); 
list.add("www.example.com/data"); 
list.add("www.example.com/users"); 
list.add("www.example.com/admin/data");

for(String link : list){ 
    if((link.length() - link.replaceAll("[/]", "").length()) == 1){ 
     System.out.println(link); 
    } 
}

link.length()：計數
link.replaceAll("[/]", "").length() 字符數：計數斜線

的數量

如果差值等於1，那麼右邊的鏈接否則不是。

編輯

如何將我掃描子鏈接整個網站？

答案這與的robots.txt文件或Robots exclusion standard，所以在這個它定義網站例如https://stackoverflow.com/robots.txt的所有子鏈接，這樣的想法是，要讀這個文件，你可以提取該網址這裏的子鏈接是一段代碼，可以幫助你：

public static void main(String[] args) throws Exception { 

    //Your web site 
    String website = "http://stackoverflow.com"; 
    //We will read the URL https://stackoverflow.com/robots.txt 
    URL url = new URL(website + "/robots.txt"); 

    //List of your sub-links 
    List<String> list; 

    //Read the file with BufferedReader 
    try (BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()))) { 
     String subLink; 
     list = new ArrayList<>(); 

     //Loop throw your file 
     while ((subLink = in.readLine()) != null) { 

      //Check if the sub-link is match with this regex, if yes then add it to your list 
      if (subLink.matches("Disallow: \\/\\w+\\/")) { 
       list.add(website + "/" + subLink.replace("Disallow: /", "")); 
      }else{ 
       System.out.println("not match"); 
      } 
     } 
    } 

    //Print your result 
    System.out.println(list); 
}

這將告訴你：

[https://stackoverflow.com/posts/，https://stackoverflow.com/posts？， https://stackoverflow.com/search/，https://stackoverflow.com/search？， https://stackoverflow.com/feeds/，https://stackoverflow.com/feeds？， https://stackoverflow.com/unanswered/， https://stackoverflow.com/unanswered？，https://stackoverflow.com/u/， https://stackoverflow.com/messages/，https://stackoverflow.com/ajax/， https://stackoverflow.com/plugins/]

這裏是一個Demo about the regex that i use。

希望這可以幫助你。

來源

2017-03-27 11:52:58

但是，我將如何掃描整個網站的子鏈接 –

你的實現將工作後，我會得到網站上的所有內部鏈接 –

檢查我的編輯@javafan的想法是閱讀** robots.txt **它包含網站的所有信息，所以你可以從那裏提取子鏈接 –

要掃描網頁上的鏈接，您可以使用JSoup庫。如前面的回答表明可以用來

import java.io.IOException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 

class read_data { 

    public static void main(String[] args) { 
     try { 
      Document doc = Jsoup.connect("**your_url**").get(); 
      Elements links = doc.select("a"); 
      List<String> list = new ArrayList<>(); 
      for (Element link : links) { 
       list.add(link.attr("abs:href")); 
      } 
     } catch (IOException ex) { 

     } 
    } 
}

列表。

閱讀網站上所有鏈接的代碼如下所示。我已使用http://stackoverflow.com/進行說明。我建議你先瀏覽公司的terms of use，然後再揪出網站。

import java.io.IOException; 
import java.util.HashSet; 
import java.util.Set; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.select.Elements; 

public class readAllLinks { 

    public static Set<String> uniqueURL = new HashSet<String>(); 
    public static String my_site; 

    public static void main(String[] args) { 

     readAllLinks obj = new readAllLinks(); 
     my_site = "stackoverflow.com"; 
     obj.get_links("http://stackoverflow.com/"); 
    } 

    private void get_links(String url) { 
     try { 
      Document doc = Jsoup.connect(url).get(); 
      Elements links = doc.select("a"); 
      links.stream().map((link) -> link.attr("abs:href")).forEachOrdered((this_url) -> { 
       boolean add = uniqueURL.add(this_url); 
       if (add && this_url.contains(my_site)) { 
        System.out.println(this_url); 
        get_links(this_url); 
       } 
      }); 

     } catch (IOException ex) { 

     } 

    } 
}

您將獲得uniqueURL字段中所有鏈接的列表。

來源

2017-03-28 11:03:39

感謝您的幫助，但讓我告訴你，我不想簡單地在網頁上獲取鏈接，我想要獲得整個網站的鏈接。 –

你可以看到[this]（http://stackoverflow.com/questions/32299871/java-get-every-webpage-associated-with-domain-name-programmatically）。讓我知道如果這不適合你。 –

我接受的答案也是一樣的 –

使用jsoup獲取URL的子鏈接

回答

相關問題