Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
430 views
in Technique[技术] by (71.8m points)

jsoup HTML fragment detection

I'm parsing a html fragment without knowing that this is a fragment. I use the jsoup HTML parser. For example:

    String html = "<script>document.location = "http://example.com/";</script>";
    Document document = Jsoup.parse(html);
    System.out.println(document.html());

Output:

<html>
   <head>
     <script>document.location = "http://example.com/";</script>
   </head>
  <body></body>
</html>

Question: Is there a way to know that the <html>, <head> and <body> tags were added by Jsoup and were not in the original html fragment?

Update:

I also tried to enable the errors tracking:

Parser parser = Parser.htmlParser();
parser.setTrackErrors(500);
Document document = parser.parseInput(html, "example.com");
ParseErrorList errors = parser.getErrors();

But I get an empty list of errors.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The simplest way to do this would be to parse it both as XML and as HTML, and compare the element counts of both results. The XML parser does not automatically add elements, whereas the HTML parser automatically adds missing optional tags and performs other normalization.

Here's an example:

@Test public void detectAutoElements() {
    String bare = "<script>One</script>";
    String full =
       "<html><head><title>Check</title></head><body><p>One</p></body></html>";

    assertTrue(didAddElements(bare));
    assertFalse(didAddElements(full));
}

private boolean didAddElements(String input) {
    // two passes, one as XML and one as HTML. XML does not vivify missing/optional tags
    Document html = Jsoup.parse(input);
    Document xml = Jsoup.parse(input, "", Parser.xmlParser());

    int htmlElementCount = html.getAllElements().size();
    int xmlElementCount = xml.getAllElements().size();
    boolean added = htmlElementCount > xmlElementCount;

    System.out.printf(
      "Original input has %s elements; HTML doc has %s. Is a fragment? %s
",
      xmlElementCount, htmlElementCount, added);

    return added;
}

This gives the result:

Original input has 2 elements; HTML doc has 5. Is a fragment? true
Original input has 6 elements; HTML doc has 6. Is a fragment? false

Depending on your need, you could potentially extend this to more deeply compare the two document structures.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...