How to parse HTML/XML

(Or Any Arbitrarily Nested Data)

Summary

When faced with the task of parsing HTML (or XML and some other similar grammars) many people immediately think of using the powerful text processing capabilities of regular expressions to do the work for them. This is usually the wrong approach. HTML is a very 'loose' language to begin with and additionally it has over the years become more and more abused by lazy programmers and novices who don't follow its specifications or grammar rules. This leaves us with tremendous amount of non-conforming or outright broken HTML code out there that is being used on a regular basis. Over the years, parsers have evolved to the point of being able to cope with common problematic HTML and will happily parse out even the most horrible pages for you at least with some degree of accuracy to the document's original intent.

With that said, regular expressions have not (nor would they have any reason to have) evolved over the years to deal with the voluminous amount of horrid HTML out there. They are for matching specific patterns. They can be applied to things that have a known structure or format. They are inherently not good at distinguishing between patterns that a human (or a token parser) could easily distinguish such as (but not limited to) HTML nested in comments, overlapping tags, HTML entities, etc. They are also not good at focusing on a particular part of a document based on the relative structure. Most importantly, they are very bad at adapting to even small changes in the document itself.

So without further ado, here is how you parse HTML documents:

DON'T use a Regular Expression (Regex, Regexp, RE)

DO use an HTML/XML Parser (examples)

When you can make some very strict guarantees about your data, it MIGHT be okay to parse it with a regular expression.

If...

If you can not guarantee ALL of the above, DON'T DON'T DON'T use a regular expression

Links

Further Discussion

Parsing HTML With Regexes A perlmonks thread in which #perlhelp's very own woggle discusses the topic at hand.
Bring Me Your Regexs! I Will Create HTML To Break Them! An article on how regexes break while parsing HTML.
Do Not... DO NOT! Parse HTML with Regex's Further reiteration for the logic impaired.

Parsers

HTML::Parser
HTML::TableExtract
HTML::TokeParser
HTML::LinkExtor
Various Perl HTML Parser modules.
XML::Parser
XML::SAX
XML::Simple
Various Perl XML Parser modules.
HTML Agility Pack
A .NET Parser that is tolerant of malformed (real-world) HTML
Python HTMLParser class
Python htmllib parsing module
Beautiful Soup and a Ruby port called Rubyful Soup (Thanks Ezio!)
HTML parsers for Python (Thanks Kenneth!)
Java HTMLParser Library
A parser for 'real world' HTML in Java.
The Regex Programming Wiki
Mark from The Regex Programming Wiki sent me a link to his site which has some great regex info as well as links to several HTML parsers in the FAQ section! Check it out!
Please note, I'm very interested in hearing of parser implementations that I'm missing or in languages not covered here. If you know of any, please send me a note to the address at the bottom of this page. If you find this page useful, I'd also appreciate hearing from you!

If you would like a specific credit other than a 'thanks <your name>' also, please let me know!

Valid HTML 4.01 Strict Valid CSS!


<matt at icenine dot ca>