
Lagarto is an all purpose fast and versatile event-based HTML parser. It can be used to modify or analyze some markup content, allowing to easily assemble custom complex transformations and code analysis tools.
Using Lagarto is very simple. Target HTML code is parsed and each HTML element is visited by user callback methods. For example, the following HTML snippet:
<html><body>Hello</body></html>
would produce the following set of events:
start(); tag(); // for opening 'html' tag tag(); // for opening 'body' tag text(); // for text 'Hello' tag(); // for closing 'body' tag ...
This makes Lagarto very, very fast; in fact it is one of the fastest parsers out there.
Lagarto is to HTML what ASM is to bytecode :)
Lagarto is just a HTML content parser, but many other cool things are built on top of it:
To parse a HTML content with Lagarto you must do just two steps: create a Lagarto parser instance providing content; and then invoke parsing using custom implementation of TagVisitor:
LagartoParser lagartoParser = new LagartoParser(htmlContent); TagVisitor tagVisitor = new FooTagVisitor(); lagartoParser.parse(tagVisitor);
While content is parsing, Lagarto will call various callback methods of TagVisitor. Here is the list of some most important callbacks, check the javadoc for more information.
Invoked before and after content is parsed.
Invoked on plain text.
Invoked on HTML comment. Argument contains just comment content, without comment boundaries.
Invoked on a HTML tag: open, close or empty tag. Argument is a Tag instance, that contains further information about the tag: tag name, attributes, depth level and so on.
Invoked on all script tags. Passed is a script Tag instance and the body.
Tag instance is reused during visiting HTML code for better performances! The same instance is passed to all callback methods.
See javadoc for other callback methods. Also, you can use EmptyTagVisitor to write your own visitors in more convenient ways.
Lagarto is very error-friendly. It will try to do its best to resolve an error and continue parsing. Every error is reported by invoking error() method of a visitor. By default, errors are written in the log as warnings.
Lagarto may be used not only for parsing and analyzing HTML content, but also for modifying it. For that purpose you can use TagWriter and TagAdapter.
TagWriter is a simple TagVisitor that writes content to some Appendable. So, if you pass TagWriter instance to parse() method the resulting code will be the same as the input HTML code.
TagAdapter is a generic adapter for underlying TagVisitor. It may be used over some e.g. TagWriter to perform some HTML transformation.
This approach allows to perform HTML modifications during just one parsing! For example, if you have 3 different adapters that modifies HTML code in some way, they all can be applied with just one content parsing.
Lagarto is an event-base parser. While this gives great performances and low memory consuption, sometimes it is more convenient to build a DOM tree first and then to manage it. Of course, creating DOM requires more memory and more processing time.
Lagarto provides DOMBuilderTagVisitor for building DOM trees from HTML content. It can be easily used through LagartoDOMBuilder, like this:
LagartoDOMBuilder domBuilder = new LagartoDOMBuilder(); Document doc = domBuilder.parse(content);
LagartoDOMBuilder returns a Document - the root DOM tree root. From there you can use DOM tree as usual.
But that's not all:) While using DOM API is fine, it is easier to parse and manipulate HTML content using API that looks like JQuery - including using CSS selectors.
For that Jodd provides a tool called Jerry, built on DOM tree.
Lagarto can be configured and fine tuned in many ways to parse and interpret input content. See more details about Lagarto parsing modes.
One common use of Lagarto is as servlet filter. There it may parse and modify the content before it is written to the output stream. For this purpose there is LagartoServletFilter, an abstract filter that can be overridden.
LagartoServletFilter is quite generic filter and can be used by any other parsing tool or solution. Anyhow, in most cases, the SimpleLagartoServletFilter will be enough. Here user just has to override one method to build custom LagartoParsingProcessor. Inside the implementation, user builds set of nested TagAdapters and pass it to the method invokeLagarto().
public class AppLagartoServletFilter extends SimpleLagartoServletFilter {
@Override
protected LagartoParsingProcessor createParsingProcessor() {
return new LagartoParsingProcessor() {
@Override
protected char[] parse(TagWriter rootTagWriter, HttpServletRequest request) {
TagAdapter1 tagAdapter1 = new FooTagAdapter(rootTagWriter);
TagAdapter2 tagAdapter2 = new BarTagAdapter(tagAdapter1, request);
char[] content = invokeLagarto(tagAdapter2);
return content;
}
};
}
}