2016-03-11 93 views
1

我有下面使用JSoup抓取網站的代碼,但我想同時抓取多個網址。 我將這些URL存儲在一個數組中,但我無法使其正常工作。 如果我想使用它,該代碼如何在多線程中實現?多線程是否適合這種應用程序?如何使用jsoup抓取多個網址

public class Webcrawler { 
    public static void main(String[] args) throws IOException { 

     String [] url = {"http://www.dmoz.org/","https://docs.oracle.com/en/"}; 
     //String [] url = new String[3]; 
     //url[0] = "http://www.dmoz.org/"; 
     //url[1] = "http://www.dmoz.org/Computers/Computer_Science/"; 
     //url[2] = "https://docs.oracle.com/en/"; 

     for(String urls : url){ 
      System.out.print("Sites to be crawled\n " + urls); 
     } 
     //String url = "http://www.dmoz.org/"; 
     print("\nFetching %s...", url); 

     Document doc = Jsoup.connect(url[0]).get(); 
     org.jsoup.select.Elements links = doc.select("a"); 
     //doc.select("a[href*=https]");//(This is the one you are looking for)selects if value of href contatins https 
     print("\nLinks: (%d)", links.size()); 
     for (Element link : links) { 
      print(" (%s)", link.absUrl("href") /*link.attr("href")*/, trim(link.text(), 35));  
     } 
    } 

    private static void print(String msg, Object... args) { 
     System.out.println(String.format(msg, args)); 
    } 

    private static String trim(String s, int width) { 
     if (s.length() > width) 
      return s.substring(0, width-1) + "."; 
     else 
      return s; 
    } 
} 
+0

我發現這非常有幫助。請花點時間瀏覽一下.... http://mrbool.com/how-to-create-a-web-crawler-and-storing-data-using -java/28925 – SmashCode

+0

「我無法實現它」不是一個問題描述。什麼*確切*是你的問題? – Raedwald

+0

@Raedwald,我的問題是我想創建一個可以同時抓取多個網站的網絡爬蟲,現在它只抓取單個網站/ url.i想存儲一個數組中的網址並抓取它們 –

回答

1

您既可以同時使用多線程,也可以同時抓取多個網站。以下代碼可以滿足您的需求。我非常確定它可以得到很大的改進(例如通過使用Executor),但我很快就寫下了它。

public class Main { 

    public static void main(String[] args) { 

     String[] urls = new String[]{"http://www.dmoz.org/", "http://www.dmoz.org/Computers/Computer_Science/", "https://docs.oracle.com/en/"}; 

     // Create and start workers 
     List<Worker> workers = new ArrayList<>(urls.length); 
     for (String url : urls) { 
      Worker w = new Worker(url); 
      workers.add(w); 
      new Thread(w).start(); 
     } 

     // Retrieve results 
     for (Worker w : workers) { 
      Elements results = w.waitForResults(); 
      if (results != null) 
       System.out.println(w.getName()+": "+results.size()); 
      else 
       System.err.println(w.getName()+" had some error!"); 
     } 
    } 
} 

class Worker implements Runnable { 

    private String url; 
    private Elements results; 
    private String name; 
    private static int number = 0; 

    private final Object lock = new Object(); 

    public Worker(String url) { 
     this.url = url; 
     this.name = "Worker-" + (number++); 
    } 

    public String getName() { 
     return name; 
    } 

    @Override 
    public void run() { 
     try { 
      Document doc = Jsoup.connect(this.url).get(); 

      Elements links = doc.select("a"); 

      // Update results 
      synchronized (lock) { 
       this.results = links; 
       lock.notifyAll(); 
      } 
     } catch (IOException e) { 
      // You should implement a better error handling code.. 
      System.err.println("Error while parsing: "+this.url); 
      e.printStackTrace(); 
     } 
    } 

    public Elements waitForResults() { 
     synchronized (lock) { 
      try { 
       while (this.results == null) { 
        lock.wait(); 
       } 
       return this.results; 
      } catch (InterruptedException e) { 
       // Again better error handling 
       e.printStackTrace(); 
      } 

      return null; 
     } 
    } 
} 
+1

可以重寫用戶使用ReentrantLock。 (請參閱:http://stackoverflow.com/a/11821900/363573) – Stephan

+0

對,感謝您的鏈接;) – user2340612

+0

謝謝@ user2340612 –