Htmlys

Htmlys is a free library that allows developers to parse web content and react to HTML tokens (such as the ubiquitous tags) discovered while parsing. The library is distributed both as a dynamic extension for PHP and as a dynamic module for Python and is intended to extend the facilities offered by both programming languages.

Whereas interfaced through PHP and Python userland, the core of the library has been compiled into a binary form that will deliver a significantly better parsing speed than if it was written in PHP or Python code, which would have to be compiled to intermediate bytecode first, and run by a virtual machine then. As such, the tool allows PHP and Python developers to focus on their problem and to create fast and professional web spiders or other powerful scraping utilities with ease and flexibility.

Unlike the libxml extension bundled with PHP, a tool that is sometimes used to parse HTML contents, Htmlys is a real native HTML 5 parser which could even power your web browser. Furthermore, while aware of the latest HTML 5 recommendation, it can also parse legacy HTML documents, because HTML 5 is backwards compatible with its previous versions. Generally speaking, the main differences between both tools are:

- libxml is designed to parse well-formed XML documents, it may fail (or produce warnings) on HTML documents, which are often not well-formed. Warnings can be turned off but are dispatched by default.

- Htmlys is designed to parse HTML documents, it will not fail if the documents are not well-formed. Parse errors are detected but dispatched only if desired.

- libxml does not know about HTML semantics, it will treat elements equally, as required by the XML specification.

- Htmlys knows about HTML semantics, it will treat elements according to the rules defined in the HTML 5 specification. <script>, for example, has its own parsing rules.

- Similarly, libxml does not know about HTML entities, it will recognize and map only the ones that are defined in XML.

- Htmlys knows about HTML entities, it will recognize and map all the entities defined in the HTML 5 specification.

Please, note that while we tested it successfully on many existent web pages, the software is in an experimental state and may sometimes behave in an unexpected way in untested conditions. We however improve our test suites and the software itself when inconsistencies are found, and we are looking forward to release it in a stable status soon. If you find a bug, you can help us to improve it by sending us your feedback and possibly providing the document and the script that triggered the problem.

Please consult the manual for PHP or the manual for Python to learn how to use the library.