The Unbearable Lightness of Java

Lagarto parsing modes

Lagarto parser is designed as a simple content parser. It does not care if content is malformed, neither if the content is HTML, XHTML or XML. As Lagarto does not generates content, there are not many configuration options.

However, Lagarto DOM parser is a different story. As it's main goal is to build a DOM tree and then to generate the content, Lagarto DOM can be configured in many ways.

As Jerry uses Lagarto DOM, all this applies to that tool as well.

Lagarto DOM Builder configuration

LagartoDOMBuilder is Lagarto DOM builder and has few configuration settings that can be used to read and generate HTML, XHTML or XML content.

caseSensitive

Defines if tag names and attribute names are case sensitive.

parseSpecialTagsAsCdata

Special tags (in Lagarto) are: style, script and xmp. This flag determines if special tags content should be parsed or not - in other words, is it treated as PCDATA or CDATA section. When not parsed, content is simply taken as it is, so using signs like < or & is allowed. Otherwise, the content has to be correctly encoded.

ignoreComments

Irrelevant from content type, this flag simply defines if resulting DOM tree should contain comments or not.

voidTags

List of void elements names. By default, it contains all HTML5 void elements. If set to null, then there are no void elements.

selfCloseVoidTags

When an element is a void element, this flag defines if it can be self-closed or if it should have the standard end tag.

ignoreWhitespacesBetweenTags

This flag is used for XML modes, to ignore all whitespace content between two start or two end tags. Whitespace content between one open and one closed tag is still not ignored.

Predefined parsing modes

There are 3 predefined parsing modes: HTML, XHTML and XML. They can be easily set by calling enableXxxMode() on LagartoDOMBuilder. These methods will configure the builder to work with HTML, XHTML or XML code. Here are the details.

HTML mode (default)

ignoreWhitespacesBetweenTags = false;	// collect all whitespaces
caseSensitive = false;					// HTML is case insensitive
parseSpecialTagsAsCdata = true;			// script and style tags are parsed as CDATA
voidTags = HTML5_VOID_TAGS;				// list of void tags
selfCloseVoidTags = false;				// don't self close void tags

XHTML mode

		ignoreWhitespacesBetweenTags = false;	// collect all whitespaces
		caseSensitive = true;					// XHTML is case sensitive
		parseSpecialTagsAsCdata = false;		// all tags are parsed in the same way
		voidTags = HTML5_VOID_TAGS;				// list of void tags
		selfCloseVoidTags = true;				// self close void tags

XML mode

		ignoreWhitespacesBetweenTags = true;	// ignore whitespaces that are non content
		caseSensitive = true;					// XML is case sensitive
		parseSpecialTagsAsCdata = false;		// all tags are parsed in the same way
		voidTags = null;						// there are no void tags
		selfCloseVoidTags = false;				// don't self close empty tags (can be changed!)

User can further change these predefined modes by setting individual flags.

Void and self-closing elements

(X)HTML modes are aware of void elements. Lagarto DOM builder tries to follow the specification and, therefore:

  • void elements in HTML mode are not self-closed
  • void elements in XHTML mode are self-closed

Regular, content elements are closed with close tag, even if it is an empty element.

XML mode is not aware of void elements. By default, empty elements are closed by closing tag, but that can be change to self-closing tags.

Fixing malformed content

Lagarto DOM tries to handle malformed content in an user-friendly way. It tries to fixes problems in best way possible. Fixing errors also depends on parsing mode. Every error is logged as a warning.

The most important fix is handling missing ending tag. In order to support new feature in HTML5 - some HTML5 elements do not require an end tag - Lagarto DOM builder will do the following: wrap the first next non-blank sibling (i.e. node with a content). So the following example (notice missing tags):

<div class="section" id="forest-elephants" >
	<h1>Forest elephants</h1>
	<p>In this section, we discuss the lesser known forest elephants...
	<div class="subsection" id="forest-habitat" >
		<h2>Habitat</h2>
		<p>Forest elephants do not live in trees but among the...
	</div>
</div>

would be interpreted as (notice added missing tag):

<div class="section" id="forest-elephants" >
	<h1>Forest elephants</h1>
	<p>In this section, we discuss the lesser known forest elephants...
	</p><div class="subsection" id="forest-habitat" >
		<h2>Habitat</h2>
		<p>Forest elephants do not live in trees but among the...
	</p></div>
</div>

Strict rules

Lagarto is not strict about the content and can't be used for validation.

Little example

Here is how predefined parsing modes can be used.

	LagartoDOMBuilder lagartoDOMBuilder = new LagartoDOMBuilder();
	Document doc = lagartoDOMBuilder.enableHtmlMode().parse(content);