Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
178 views
in Technique[技术] by (71.8m points)

php - DOM parser that allows HTML5-style </ in <script> tag

Update: html5lib (bottom of question) seems to get close, I just need to improve my understanding of how it's used.

I am attempting to find an HTML5-compatible DOM parser for PHP 5.3. In particular, I need to access the following HTML-like CDATA within a script tag:

<script type="text/x-jquery-tmpl" id="foo">
    <table><tr><td>${name}</td></tr></table>
</script>

Most parsers will end parsing prematurely because HTML 4.01 ends script tag parsing when it finds ETAGO (</) inside a <script> tag. However, HTML5 allows for </ before </script>. All of the parsers I have tried so far have either failed, or they are so poorly documented that I haven't figured out if they work or not.

My requirements:

  1. Real parser, not regex hacks.
  2. Ability to load full pages or HTML fragments.
  3. Ability to pull script contents back out, selecting by the tag's id attribute.

Input:

<script id="foo"><td>bar</td></script>

Example of failing output (no closing </td>):

<script id="foo"><td>bar</script>

Some parsers and their results:


DOMDocument (fails)

Source:

<?php

header('Content-type: text/plain');
$d = new DOMDocument;
$d->loadHTML('<script id="foo"><td>bar</td></script>');
echo $d->saveHTML();

Output:

Warning: DOMDocument::loadHTML(): Unexpected end tag : td in Entity, line: 1 in /home/adam/public_html/2010/10/26/dom.php on line 5
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><script id="foo"><td>bar</script></head></html>


FluentDOM (fails)

Source:

<?php

header('Content-type: text/plain');
require_once 'FluentDOM/src/FluentDOM.php';
$html = "<html><head></head><body><script id='foo'><td></td></script></body></html>";
echo FluentDOM($html, 'text/html');

Output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head></head><body><script id="foo"><td></script></body></html>


phpQuery (fails)

Source:

<?php

header('Content-type: text/plain');

require_once 'phpQuery.php';

phpQuery::newDocumentHTML(<<<EOF
<script type="text/x-jquery-tmpl" id="foo">
<td>test</td>
</script>
EOF
);

echo (string)pq('#foo');

Output:

<script type="text/x-jquery-tmpl" id="foo">
<td>test
</script>


html5lib (passes)

Possibly promising. Can I get at the contents of the script#foo tag?

Source:

<?php

header('Content-type: text/plain');

include 'HTML5/Parser.php';

$html = "<!DOCTYPE html><html><head></head><body><script id='foo'><td></td></script></body></html>";
$d = HTML5_Parser::parse($html);

echo $d->saveHTML();

Output:

<html><head></head><body><script id="foo"><td></td></script></body></html>
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I had the same problem and apparently you can hack your way trough this by loading the document as XML, and save it as HTML :)

$d = new DOMDocument;
$d->loadXML('<script id="foo"><td>bar</td></script>');
echo $d->saveHTML();

But of course the markup must be error-free for loadXML to work.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...