2013-10-12 70 views
4

我目前在應用程序中使用jsoup來解析和分析網頁。但我想確保我遵守robot.txt規則並且只能訪問允許的頁面。使用java解析robot.txt並確定URL是否被允許

我非常肯定,jsoup不是爲此而設計的,而是關於網絡抓取和解析。 所以我打算讓函數/模塊讀取域名/站點的robot.txt,並確定我要訪問的URL是否被允許。

我做了一些研究,發現了以下內容。但是我不確定這些,所以如果有人做了同樣的項目,其中涉及到robot.txt解析,請分享您的想法和想法。

http://sourceforge.net/projects/jrobotx/

https://code.google.com/p/crawler-commons/

http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12

+0

有什麼問題到底是什麼?解析robot.txt似乎超出了Jsoup的範圍。 Jsoup就像你自己說的那樣解析網頁。 – Darwind

+0

謝謝,雅,我使用jsoup解析頁面......但要求是解析urls.txt中只允許(不受限制)的網址..併爲此驗證似乎JSoup不是最好的或不能。所以我需要知道的是,在進行實際解析之前,我如何才能在robots.txt上實現此驗證。 –

+0

好吧,這很好。我正在尋找一個使用jsoup的小項目,所以我可以自己做。 – alkis

回答

6

逾期答案,以防萬一你 - 或其他人 - 仍然在尋找一種方式來做到這一點。我在版本0.2中使用https://code.google.com/p/crawler-commons/,它似乎運行良好。下面是我使用的代碼一個簡單的例子:

String USER_AGENT = "WhateverBot"; 
String url = "http://www.....com/"; 
URL urlObj = new URL(url); 
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost() 
       + (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : ""); 
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>(); 
BaseRobotRules rules = robotsTxtRules.get(hostId); 
if (rules == null) { 
    HttpGet httpget = new HttpGet(hostId + "/robots.txt"); 
    HttpContext context = new BasicHttpContext(); 
    HttpResponse response = httpclient.execute(httpget, context); 
    if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) { 
     rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL); 
     // consume entity to deallocate connection 
     EntityUtils.consumeQuietly(response.getEntity()); 
    } else { 
     BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity()); 
     SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser(); 
     rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()), 
       "text/plain", USER_AGENT); 
    } 
    robotsTxtRules.put(hostId, rules); 
} 
boolean urlAllowed = rules.isAllowed(url); 

顯然,以任何方式,這是不相關的Jsoup,它只是檢查是否一個給定的URL被允許抓取一定USER_AGENT。爲了獲取robots.txt,我在4.2.1版本中使用了Apache HttpClient,但是也可以用java.net來替換它。

請注意,此代碼僅檢查是否允許或不允許,並且不考慮其他robots.txt功能,如「抓取延遲」。但是,由於爬蟲公用程序也提供此功能,因此可以輕鬆將其添加到上面的代碼中。

1

以上不適合我。我採取了設法把它放在一起。我第一次在4年內完成Java,所以我相信這可以得到改善。

public static boolean robotSafe(URL url) 
{ 
    String strHost = url.getHost(); 

    String strRobot = "http://" + strHost + "/robots.txt"; 
    URL urlRobot; 
    try { urlRobot = new URL(strRobot); 
    } catch (MalformedURLException e) { 
     // something weird is happening, so don't trust it 
     return false; 
    } 

    String strCommands; 
    try 
    { 
     InputStream urlRobotStream = urlRobot.openStream(); 
     byte b[] = new byte[1000]; 
     int numRead = urlRobotStream.read(b); 
     strCommands = new String(b, 0, numRead); 
     while (numRead != -1) { 
      numRead = urlRobotStream.read(b); 
      if (numRead != -1) 
      { 
        String newCommands = new String(b, 0, numRead); 
        strCommands += newCommands; 
      } 
     } 
     urlRobotStream.close(); 
    } 
    catch (IOException e) 
    { 
     return true; // if there is no robots.txt file, it is OK to search 
    } 

    if (strCommands.contains(DISALLOW)) // if there are no "disallow" values, then they are not blocking anything. 
    { 
     String[] split = strCommands.split("\n"); 
     ArrayList<RobotRule> robotRules = new ArrayList<>(); 
     String mostRecentUserAgent = null; 
     for (int i = 0; i < split.length; i++) 
     { 
      String line = split[i].trim(); 
      if (line.toLowerCase().startsWith("user-agent")) 
      { 
       int start = line.indexOf(":") + 1; 
       int end = line.length(); 
       mostRecentUserAgent = line.substring(start, end).trim(); 
      } 
      else if (line.startsWith(DISALLOW)) { 
       if (mostRecentUserAgent != null) { 
        RobotRule r = new RobotRule(); 
        r.userAgent = mostRecentUserAgent; 
        int start = line.indexOf(":") + 1; 
        int end = line.length(); 
        r.rule = line.substring(start, end).trim(); 
        robotRules.add(r); 
       } 
      } 
     } 

     for (RobotRule robotRule : robotRules) 
     { 
      String path = url.getPath(); 
      if (robotRule.rule.length() == 0) return true; // allows everything if BLANK 
      if (robotRule.rule == "/") return false;  // allows nothing if/

      if (robotRule.rule.length() <= path.length()) 
      { 
       String pathCompare = path.substring(0, robotRule.rule.length()); 
       if (pathCompare.equals(robotRule.rule)) return false; 
      } 
     } 
    } 
    return true; 
} 

而且你將需要輔助類:

/** 
* 
* @author Namhost.com 
*/ 
public class RobotRule 
{ 
    public String userAgent; 
    public String rule; 

    RobotRule() { 

    } 

    @Override public String toString() 
    { 
     StringBuilder result = new StringBuilder(); 
     String NEW_LINE = System.getProperty("line.separator"); 
     result.append(this.getClass().getName() + " Object {" + NEW_LINE); 
     result.append(" userAgent: " + this.userAgent + NEW_LINE); 
     result.append(" rule: " + this.rule + NEW_LINE); 
     result.append("}"); 
     return result.toString(); 
    }  
}