請勿在特定鏈接中抓取某個頁面（排除某些網址無法抓取）

這是MyCrawler.java中的以下代碼，它正在抓取我在href.startsWith中提供的所有鏈接，但假設如果我不想抓取這個特定頁面http://inv.somehost.com/people/index.html話，我怎麼能在我的代碼做..請勿在特定鏈接中抓取某個頁面（排除某些網址無法抓取）

public MyCrawler() { 
    } 

    public boolean shouldVisit(WebURL url) { 

     String href = url.getURL().toLowerCase(); 


    if (href.startsWith("http://www.somehost.com/") || href.startsWith("http://inv.somehost.com/") || href.startsWith("http://jo.somehost.com/")) { 
//And If I do not want to crawl this page http://inv.somehost.com/data/index.html then how it can be done.. 


        return true; 
       } 
       return false; 
      } 


    public void visit(Page page) { 

     int docid = page.getWebURL().getDocid(); 

     String url = page.getWebURL().getURL();   
     String text = page.getText(); 
     List<WebURL> links = page.getURLs(); 
     int parentDocid = page.getWebURL().getParentDocid(); 

     try { 
      URL url1 = new URL(url); 
      System.out.println("URL:- " +url1); 
      URLConnection connection = url1.openConnection(); 

      Map responseMap = connection.getHeaderFields(); 
      Iterator iterator = responseMap.entrySet().iterator(); 
      while (iterator.hasNext()) 
      { 
       String key = iterator.next().toString(); 

       if (key.contains("text/html") || key.contains("text/xhtml")) 
       { 
        System.out.println(key); 
        // Content-Type=[text/html; charset=ISO-8859-1] 
        if (filters.matcher(key) != null){ 
         System.out.println(url1); 
         try { 
          final File parentDir = new File("crawl_html"); 
          parentDir.mkdir(); 
          final String hash = MD5Util.md5Hex(url1.toString()); 
          final String fileName = hash + ".txt"; 
          final File file = new File(parentDir, fileName); 
          boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt 


          System.out.println("hash:-" + hash); 

            System.out.println(file); 
          // Create file if it does not exist 



           // File did not exist and was created 
           FileOutputStream fos = new FileOutputStream(file, true); 

           PrintWriter out = new PrintWriter(fos); 

           // Also could be written as follows on one line 
           // Printwriter out = new PrintWriter(new FileWriter(args[0])); 

              // Write text to file 
           Tika t = new Tika(); 
           String content= t.parseToString(new URL(url1.toString())); 


           out.println("==============================================================="); 
           out.println(url1); 
           out.println(key); 
           //out.println(success); 
           out.println(content); 

           out.println("==============================================================="); 
           out.close(); 
           fos.flush(); 
           fos.close(); 



         } catch (FileNotFoundException e) { 
          // TODO Auto-generated catch block 
          e.printStackTrace(); 
         } catch (IOException e) { 
          // TODO Auto-generated catch block 

          e.printStackTrace(); 
         } catch (TikaException e) { 
          // TODO Auto-generated catch block 
          e.printStackTrace(); 
         } 


         // http://google.com 
        } 
       } 


      } 



     } catch (MalformedURLException e) { 
      e.printStackTrace(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 



     System.out.println("============="); 
    }

這是從哪兒MyCrawler獲取調用我Controller.java代碼..

public class Controller { 
    public static void main(String[] args) throws Exception { 
      CrawlController controller = new CrawlController("/data/crawl/root"); 
      controller.addSeed("http://www.somehost.com/"); 
      controller.addSeed("http://inv.somehost.com/"); 
      controller.addSeed("http://jo.somehost.com/"); 
      controller.start(MyCrawler.class, 20); 
      controller.setPolitenessDelay(200); 
      controller.setMaximumCrawlDepth(2); 
    } 
}

任何建議將讚賞..

來源

2011-07-13 ferhan

如何添加一個屬性來告訴你想排除哪個url。

將您不希望它們抓取的所有頁面添加到您的排除列表中。

下面是一個例子：

public class MyCrawler extends WebCrawler { 


     List<Pattern> exclusionsPatterns; 

     public MyCrawler() { 
      exclusionsPatterns = new ArrayList<Pattern>(); 
      //Add here all your exclusions using Regular Expresssions 
      exclusionsPatterns.add(Pattern.compile("http://investor\\.somehost\\.com.*")); 
     } 

     /* 
     * You should implement this function to specify 
     * whether the given URL should be visited or not. 
     */ 
     public boolean shouldVisit(WebURL url) { 
       String href = url.getURL().toLowerCase(); 

       //Iterate the patterns to find if the url is excluded. 
       for (Pattern exclusionPattern : exclusionsPatterns) { 
        Matcher matcher = exclusionPattern.matcher(href); 
        if (matcher.matches()) { 
         return false; 
        } 
       } 

       if (href.startsWith("http://www.ics.uci.edu/")) { 
         return true; 
       } 
       return false; 
     } 
}

在這個例子中，我們是在告訴與http://investor.somehost.com開始的所有URL不應該被抓取。

所以這些不會被抓取：

http://investor.somehost.com/index.html 
http://investor.somehost.com/something/else

我建議你閱讀regular expresions。

來源

2011-07-13 17:59:52

那麼我們該如何做到這一點。這就是我所要求的。任何想法？ – ferhan

您可以通過使用列表排除項來完成此操作;並添加您想要排除的網址。您將需要檢查該列表以確定是否應處理頁面。 –

任何與我的代碼示例將不勝感激。而且應該在我的MyCrawler.java文件中的位置。 – ferhan

請勿在特定鏈接中抓取某個頁面（排除某些網址無法抓取）

回答

相關問題