使用jsoup提取https網址

我有以下代碼使用jsoup從給定網頁提取網址。使用jsoup提取https網址

import org.jsoup.Jsoup; 
import org.jsoup.helper.Validate; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 

import java.io.IOException; 

/** 
* Example program to list links from a URL. 
*/ 
public class ListLinks { 
    public static void main(String[] args) throws IOException { 

     String url = "http://shopping.yahoo.com"; 
     print("Fetching %s...", url); 

     Document doc = Jsoup.connect(url).get(); 
     Elements links = doc.getElementsByTag("a"); 


     print("\nLinks: (%d)", links.size()); 
     for (Element link : links) { 
     print(" * a: <%s> (%s)", link.absUrl("href") /*link.attr("href")*/, trim(link.text(), 35));  
     } 
    } 

    private static void print(String msg, Object... args) { 
     System.out.println(String.format(msg, args)); 
    } 

    private static String trim(String s, int width) { 
     if (s.length() > width) 
      return s.substring(0, width-1) + "."; 
     else 
      return s; 
    } 
}

我想要做的，是建立一個只提取https網站履帶。我給爬蟲一個種子鏈接開始，然後它應該提取所有https網站，然後採取每個提取的鏈接，並對他們做同樣的事情，直到達到一定數量的收集的網址。

我的問題：上面的代碼可以提取給定頁面中的所有鏈接。我需要提取僅以https://開頭的鏈接，爲了實現此目的，我需要做些什麼？

來源

2012-07-05 Jury A

有些網站會自動將用戶重定向到HTTPS站點，如果他們來自HTTP站點，您是否需要這樣的鏈接？（在這種情況下，這比較困難，因爲您必須在此啓動HTTP請求）。 – nhahtdh

謝謝。不，我只想從互聯網上收集https網站。 –

您可以使用jsoup的選擇器。他們非常強大。

doc.select("a[href*=https]");//(This is the one you are looking for)selects if value of href contatins https 
doc.select("a[href^=www]");//selects if value of href starts with www 
doc.select("a[href$=.com]");//selects if value of href ends with .com.

等等。用它們進行實驗，你會發現正確的。

來源

2012-07-05 05:30:24

使用jsoup提取https網址

回答

相關問題