A late answer just in case you - or someone else - are still looking for a way to do this. I am using https://code.google.com/p/crawler-commons/ in version 0.2 and it seems to work well. Here is a simplified example from the code I use:
String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
+ (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
HttpGet httpget = new HttpGet(hostId + "/robots.txt");
HttpContext context = new BasicHttpContext();
HttpResponse response = httpclient.execute(httpget, context);
if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
// consume entity to deallocate connection
EntityUtils.consumeQuietly(response.getEntity());
} else {
BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
"text/plain", USER_AGENT);
}
robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);
Obviously this is not related to Jsoup in any way, it just checks whether a given URL is allowed to be crawled for a certain USER_AGENT. For fetching the robots.txt I use the Apache HttpClient in version 4.2.1, but this could be replaced by java.net stuff as well.
Please note that this code only checks for allowance or disallowance and does not consider other robots.txt features like "Crawl-delay". But as the crawler-commons provide this feature as well, it can be easily added to the code above.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…