Htmlys manual for Python

Print

Installation

There a several ways to install Htmlys for Python on your system. We cover the most common but feel free to check Python website for more details about module installation.

Once you have downloaded the Python binding of Htmlys for your target platform, you have to extract the contents of the archive in the directory into which your Python installation expects to find third-party packages. On most Linux systems, the path to this directory is typically on of the below:

/usr/lib(Q)?/python(V.V)?/site-packages

Q is the architecture qualifier (eg. 64 on and amd64 system). This is only relevant for multilib operating system installations. For non multilib installations, you can omit the qualifier. Most multilib installations also provide a symbolic link /usr/lib pointing to the native qualifier.

V.V is the version of Python installed. It is mainly relevant for muliple installations of Python, but some distributions are known to omit the version number if there is one version of Python is installed. Please note that Htmlys for Python requires Python 3.2 or higher.

Load and use the module

First, the module has to be loaded into Python. For this, write a Python script (or just fire up an interactive python interpreter from the command line), and issue the following statement:

Print

  1. import htmlys

If the statement does not raise any error, the module is loaded successfully and ready to be used until the Python interpreter process terminates.

As long as it is loaded, it provides the class HtmlHandler, which is intended to be subclassed and its methods overriden, and two functions html_parse_string() and html_parse_file(): this class and these functions are defined within the htmlys package.

Then, the class HtmlHandler has to be subclassed into your own class, and it should override the methods of the HTML tokens you are interested in. Then, an object of your class has to be instantiated, and the functions html_parse_string() and/or html_parse_file() called with this object as the first parameter.

The second parameter of html_parse_string() has to be plain HTML code, while the second parameter of html_parse_file() has to be a path to an HTML file on your filesystem. Both functions will parse the HTML content and call the methods of your HtmlHandler object as needed.

Example

Here is an example script using the module, feel free to adapt it to your needs:

Print

  1. #
  2. # Htmlys demonstration script for Python.
  3. # -----------------------------------------------------------------------------
  4. # Copyright (c) 2009 - 2013 Krizalys (http://www.krizalys.com/)
  5. #
  6. # Script to demonstate the use of the Htmlys binding for Python. All the
  7. # methods have an empty body and could be implemented freely by the developer.
  8. # Methods that are not needed can be removed.
  9. #
  10. # Call to functions html_parse_string() and html_parse_file() can be adjusted
  11. # or removed as needed.
  12. #
  13.  
  14. import htmlys
  15.  
  16. class MyHtmlHandler(htmlys.HtmlHandler):
  17. def OnParseError(self):
  18. # Handle parse error
  19. pass
  20.  
  21. def OnDoctype(self, name, publicId, systemId):
  22. # Handle DOCTYPE token
  23. pass
  24.  
  25. def OnStartTag(self, name, attributes, selfClosing):
  26. # Handle start tag token
  27. pass
  28.  
  29. def OnEndTag(self, name):
  30. # Handle end tag token
  31. pass
  32.  
  33. def OnComment(self, data):
  34. # Handle comment token
  35. pass
  36.  
  37. def OnChar(self, c):
  38. # Handle character token
  39. pass
  40.  
  41. def OnEof(self):
  42. # Handle EOF token
  43. pass
  44.  
  45. handler = MyHtmlHandler()
  46.  
  47. htmlys.html_parse_string(handler,
  48. '''<!DOCTYPE html>
  49. <html>
  50. <head>
  51. <meta charset="utf-8" />
  52. <title>Test</title>
  53. </head>
  54. <body class="test">
  55. This a test HTML document
  56. <!-- a comment here -->
  57. </body>
  58. </html>
  59. ''')
  60.  
  61. # /path/to/document.html has to exist
  62. htmlys.html_parse_file(handler, '/path/to/document.html')