Htmlys manual for Python

Installation

There a several ways to install Htmlys for Python on your system. We cover the most common but feel free to check Python website for more details about module installation.

Once you have downloaded the Python binding of Htmlys for your target platform, you have to extract the contents of the archive in the directory into which your Python installation expects to find third-party packages. On most Linux systems, the path to this directory is typically on of the below:

/usr/lib(Q)?/python(V.V)?/site-packages

Q is the architecture qualifier (eg. 64 on and amd64 system). This is only relevant for multilib operating system installations. For non multilib installations, you can omit the qualifier. Most multilib installations also provide a symbolic link /usr/lib pointing to the native qualifier.

V.V is the version of Python installed. It is mainly relevant for muliple installations of Python, but some distributions are known to omit the version number if there is one version of Python is installed. Please note that Htmlys for Python requires Python 3.2 or higher.

Load and use the module

First, the module has to be loaded into Python. For this, write a Python script (or just fire up an interactive python interpreter from the command line), and issue the following statement:

E-mail Print

import htmlys

If the statement does not raise any error, the module is loaded successfully and ready to be used until the Python interpreter process terminates.

As long as it is loaded, it provides the class HtmlHandler, which is intended to be subclassed and its methods overriden, and two functions html_parse_string() and html_parse_file(): this class and these functions are defined within the htmlys package.

Then, the class HtmlHandler has to be subclassed into your own class, and it should override the methods of the HTML tokens you are interested in. Then, an object of your class has to be instantiated, and the functions html_parse_string() and/or html_parse_file() called with this object as the first parameter.

The second parameter of html_parse_string() has to be plain HTML code, while the second parameter of html_parse_file() has to be a path to an HTML file on your filesystem. Both functions will parse the HTML content and call the methods of your HtmlHandler object as needed.

Example

Here is an example script using the module, feel free to adapt it to your needs:

E-mail Print

#
# Htmlys demonstration script for Python.
# -----------------------------------------------------------------------------
# Copyright (c) 2009 - 2013 Krizalys (http://www.krizalys.com/)
#
# Script to demonstate the use of the Htmlys binding for Python. All the
# methods have an empty body and could be implemented freely by the developer.
# Methods that are not needed can be removed.
#
# Call to functions html_parse_string() and html_parse_file() can be adjusted
# or removed as needed.
#
 
import htmlys
 
class MyHtmlHandler(htmlys.HtmlHandler):
	def OnParseError(self):
		# Handle parse error
		pass
 
	def OnDoctype(self, name, publicId, systemId):
		# Handle DOCTYPE token
		pass
 
	def OnStartTag(self, name, attributes, selfClosing):
		# Handle start tag token
		pass
 
	def OnEndTag(self, name):
		# Handle end tag token
		pass
 
	def OnComment(self, data):
		# Handle comment token
		pass
 
	def OnChar(self, c):
		# Handle character token
		pass
 
	def OnEof(self):
		# Handle EOF token
		pass
 
handler = MyHtmlHandler()
 
htmlys.html_parse_string(handler,
'''<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>Test</title>
</head>
<body class="test">
This a test HTML document
<!-- a comment here -->
</body>
</html>
''')
 
# /path/to/document.html has to exist
htmlys.html_parse_file(handler, '/path/to/document.html')