The simplest way to do this would be to parse it both as XML and as HTML, and compare the element counts of both results. The XML parser does not automatically add elements, whereas the HTML parser automatically adds missing optional tags and performs other normalization.
Here's an example:
@Test public void detectAutoElements() {
String bare = "<script>One</script>";
String full =
"<html><head><title>Check</title></head><body><p>One</p></body></html>";
assertTrue(didAddElements(bare));
assertFalse(didAddElements(full));
}
private boolean didAddElements(String input) {
// two passes, one as XML and one as HTML. XML does not vivify missing/optional tags
Document html = Jsoup.parse(input);
Document xml = Jsoup.parse(input, "", Parser.xmlParser());
int htmlElementCount = html.getAllElements().size();
int xmlElementCount = xml.getAllElements().size();
boolean added = htmlElementCount > xmlElementCount;
System.out.printf(
"Original input has %s elements; HTML doc has %s. Is a fragment? %s
",
xmlElementCount, htmlElementCount, added);
return added;
}
This gives the result:
Original input has 2 elements; HTML doc has 5. Is a fragment? true
Original input has 6 elements; HTML doc has 6. Is a fragment? false
Depending on your need, you could potentially extend this to more deeply compare the two document structures.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…