Lagarto parser is designed as a simple content parser. It does not care if content is malformed, neither if the content is HTML, XHTML or XML. As Lagarto does not generates content, there are not many configuration options.
However, Lagarto DOM parser is a different story. As it's main goal is to build a DOM tree and then to generate the content, Lagarto DOM can be configured in many ways.
As Jerry uses Lagarto DOM, all this applies to that tool as well.
LagartoDOMBuilder is Lagarto DOM builder and has few configuration settings that can be used to read and generate HTML, XHTML or XML content.
Defines if tag names and attribute names are case sensitive.
Special tags (in Lagarto) are: style, script and xmp. This flag determines if special tags content should be parsed or not - in other words, is it treated as PCDATA or CDATA section. When not parsed, content is simply taken as it is, so using signs like < or & is allowed. Otherwise, the content has to be correctly encoded.
Irrelevant from content type, this flag simply defines if resulting DOM tree should contain comments or not.
List of void elements names. By default, it contains all HTML5 void elements. If set to null, then there are no void elements.
When an element is a void element, this flag defines if it can be self-closed or if it should have the standard end tag.
This flag is used for XML modes, to ignore all whitespace content between two start or two end tags. Whitespace content between one open and one closed tag is still not ignored.
There are 3 predefined parsing modes: HTML, XHTML and XML. They can be easily set by calling enableXxxMode() on LagartoDOMBuilder. These methods will configure the builder to work with HTML, XHTML or XML code. Here are the details.
ignoreWhitespacesBetweenTags = false; // collect all whitespaces caseSensitive = false; // HTML is case insensitive parseSpecialTagsAsCdata = true; // script and style tags are parsed as CDATA voidTags = HTML5_VOID_TAGS; // list of void tags selfCloseVoidTags = false; // don't self close void tags
ignoreWhitespacesBetweenTags = false; // collect all whitespaces caseSensitive = true; // XHTML is case sensitive parseSpecialTagsAsCdata = false; // all tags are parsed in the same way voidTags = HTML5_VOID_TAGS; // list of void tags selfCloseVoidTags = true; // self close void tags
ignoreWhitespacesBetweenTags = true; // ignore whitespaces that are non content caseSensitive = true; // XML is case sensitive parseSpecialTagsAsCdata = false; // all tags are parsed in the same way voidTags = null; // there are no void tags selfCloseVoidTags = false; // don't self close empty tags (can be changed!)
User can further change these predefined modes by setting individual flags.
(X)HTML modes are aware of void elements. Lagarto DOM builder tries to follow the specification and, therefore:
Regular, content elements are closed with close tag, even if it is an empty element.
XML mode is not aware of void elements. By default, empty elements are closed by closing tag, but that can be change to self-closing tags.
Lagarto DOM tries to handle malformed content in an user-friendly way. It tries to fixes problems in best way possible. Fixing errors also depends on parsing mode. Every error is logged as a warning.
The most important fix is handling missing ending tag. In order to support new feature in HTML5 - some HTML5 elements do not require an end tag - Lagarto DOM builder will do the following: wrap the first next non-blank sibling (i.e. node with a content). So the following example (notice missing tags):
<div class="section" id="forest-elephants" > <h1>Forest elephants</h1> <p>In this section, we discuss the lesser known forest elephants... <div class="subsection" id="forest-habitat" > <h2>Habitat</h2> <p>Forest elephants do not live in trees but among the... </div> </div>
would be interpreted as (notice added missing tag):
<div class="section" id="forest-elephants" > <h1>Forest elephants</h1> <p>In this section, we discuss the lesser known forest elephants... </p><div class="subsection" id="forest-habitat" > <h2>Habitat</h2> <p>Forest elephants do not live in trees but among the... </p></div> </div>
Lagarto is not strict about the content and can't be used for validation.
Here is how predefined parsing modes can be used.
LagartoDOMBuilder lagartoDOMBuilder = new LagartoDOMBuilder(); Document doc = lagartoDOMBuilder.enableHtmlMode().parse(content);