Welcome To Ask or Share your Answers For Others

Can you provide some examples of why it is hard to parse XML and HTML with a regex?

Welcome To Ask or Share your Answers For Others

1 Answer

answered Oct 16, 2021 by 深蓝 (71.8m points)

Here's some fun valid XML for you:

<!DOCTYPE x [ <!ENTITY y "a]>b"> ]>
<x>
    <a b="&y;>" />
    <![CDATA[[a>b <a>b <a]]>
    <?x <a> <!-- <b> ?> c --> d
</x>

And this little bundle of joy is valid HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [
    <!ENTITY % e "href='hello'">
    <!ENTITY e "<a %e;>">
]>
    <title>x</TITLE>
</head>
    <p id  =  a:b center>
    <span / hello </span>
    &amp<br left>
    <!---- >t<!---> < -->
    &e link </a>
</body>

Not to mention all the browser-specific parsing for invalid constructs.

Good luck pitting regex against that!

EDIT (J?rg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
  "http://www.w3.org/TR/html4/strict.dtd"> 
<HTML/
  <HEAD/
    <TITLE/>/
    <P/>

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

...