XHTML is easy, use lxml.
from lxml import etree
from StringIO import StringIO
etree.parse(StringIO(html), etree.HTMLParser(recover=False))
HTML is harder, since there's traditionally not been as much interest in validation among the HTML crowd (run StackOverflow itself through a validator, yikes). The easiest solution would be to execute external applications such as nsgmls or OpenJade, and then parse their output.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…