DOM Builder

Lagarto is an event-base parser. While this gives great performances and low memory consumption, sometimes it is more convenient to build a DOM tree first and then to manage it later. Of course, creating DOM requires more memory and more processing time.

Lagarto introduces DOMBuilder interface, for various implementations of DOM builders. Default implementation, LagartoDOMBuilder uses tag visitor to build a DOM from HTML content. It may be used like this:

    LagartoDOMBuilder domBuilder = new LagartoDOMBuilder();
    Document doc = domBuilder.parse(content);

DOMBuilder instance always returns a Document - the root DOM tree node. From there you can work on the DOM tree using common Node methods.

HTML content and DOM Builder

As we said on the previous page, processing HTML (like eg browsers do) requires a second step: building a DOM tree from the input tokens.

While Lagarto parser strictly follows the HTML5 rules, LagartoDOMBuilder follows only a subset of the DOM-building specification! Here is why.

By default, LagartoDOMBuilder follows all the rules that does not involve any movements of DOM nodes. This is done on purpose, so to get the exact tree to what you have provide. If you passed HTML with some tags that are not suppose to be nested, LagartoDOMBuilder would not complain and you will get exactly what you had on input.

In most cases, this is perfectly fine, as you are probably not using all the tricks of HTML5 for the sake of better readability. But if you need some more rules, you can turn them on! By enabling them, resulting DOM tree can get modified per HTML5 rules. We have implemented the most common of these rules and exceptions, but haven't covered them all (yet). So if you have some quite weird HTML, you might expect different tree then what you have in browser. Don't run away yet, it never happened in the real life that Lagarto was not enough for the job!

LagartoDOMBuilder is not (yet) a strict implementation of HTML5 DOM-building rules; but it is good-enough for most cases!

Just don't forget this and carry on :)

Custom DOM Builder

There are few ways how user can use it's own version of TagVisitor for building the DOM tree. One such way is by extending the method: createDOMDomBuilderTagVisitor() where you can provide your own implementation of DOMBuilderTagVisitor. For example:

    LagartoDOMBuilder domBuilder = new LagartoDOMBuilder() {
        @Override
        protected DOMBuilderTagVisitor createDOMDomBuilderTagVisitor() {
            return new MyDOMBuilderTagVisitor();
        }
    }
    Document doc = domBuilder.parse(content);

This way you can modify the original behavior of DOM builder. For example, you may build DOM tree by treating the HTML content as Internet Explorer (by processing conditional comments) or tree without certain tags or comments etc.

Conditional comments

Here is an example of custom DOM builder that treats HTML as non-IE browser:

    LagartoDOMBuilder domBuilder = new LagartoDOMBuilder() {
        @Override
        protected DOMBuilderTagVisitor createDOMDomBuilderTagVisitor() {
            return new DOMBuilderTagVisitor(this) {
                @Override
                public void condComment(
                        CharSequence expression,
                        boolean isStartingTag,
                        boolean isHidden,
                        CharSequence comment) {

                    String cc = expression.toString().trim();

                    if (cc.equals("if !IE") == false) {
                        enabled = cc.equals("endif");
                    }
                }
            };
        }
    };
    Document doc = domBuilder.parse(content);

This DOM builder simply ignores all conditional comments except one that specifies non-IE browsers (if !IE).

Note the usage of field enabled in the line #15. It's internal field that can be used to enable and disable DOM tree creation. Here we use it to disable DOM tree creation for all content between conditional tags except for those specified for non-IE browser.

After parsing

During parsing, LagartoDomBuilder also collects some information that are available after:

  • getErrors() - if errors are collected during parsing, this methods will return list of all errors as they appear.
  • getParsingTime() - return parsing time in milliseconds.

Jerry

But that's not all:) While using Lagarto DOM API is sufficient, it is easier to parse and manipulate HTML content using API that looks more like JQuery - including using CSS selectors.

For that Jodd provides a tool called Jerry, built on Lagarto DOM tree and CSS selector engine.

Lagarto configuration

Lagarto can be configured and fine tuned in many ways to parse and interpret input content. See more details about Lagarto parsing modes.