HTML Cleaners and Tidiers

Tag Soup

When you are crawling a website you will come across a lot of malformed web pages. Some typical problems are unclosed tags, mishandling of comments or of css styles. Modern browsers have to do a good job of cleaning HTML to build the correct DOM without ambiguities. Due to performance and scalability limitations, it is more efficient to process HTML with a parser instead of using a browser or headless browsers such as HTMLUnit or PhantomJS. If your HTML parser does not incorporate the cleaning or fixing process, you will have to use an HTML cleaner or tidier.

As in other processing pipelines if you fail to clean up malformed HTML, all subsequent processes will be stalled. It is important to choose a good HTML cleaner. Many cleaners fail to do their jobs.

HTML Cleaner List

The list of HTML cleaners is long, but the list of good ones is pretty short. In our experience the best choice is lxml.html. Other cleaners often have trouble.

Comprehensive Resources

  1. lxml.html
  2. Beautiful Soup
  3. lxml.html vs Beautiful Soup
  4. Cleaning Word’s Nasty HTML
  5. HTML Cleaners query
  6. Tag soup