Creating a search DSL

-

As an (elastic)search expert, I regularly visit customers. For these customers I often do a short analysis of their search solution and I give advice about improvements they can make. It is always interesting to look at solutions customers come up with. At one of my most recent customers I noticed a search solution based on a very extensive search DSL (Domain Specific Language) created with Antlr. I knew about Antlr, but never thought about creating my own search DSL.

To better understand the options of Antlr and to practice with creating my own DSL I started experimenting with it. In this blog post I’ll take you on my learning journey. I am going to create my own very basic search DSL.

Specifying the DSL

First we need to define the queries we would like our users to enter. Below are some examples:

  • tree – This is an easy one, just one word to search for
  • tree apple – Two words to look up
  • tree apple AND sell – Find matching content for tree apple, but also containing sell.
  • tree AND apple OR juice – Find matching content containing the terms tree and apple or containing the term juice.
  • “apple tree” OR juice – Find content having the terms apple and tree next to each other in the right order (Phrase query) or having the term juice.

These are the combinations we need to make. In the next sections we setup our environment and I explain the basics of Antlr that you need to understand to follow along.

Setting up Antlr for your project

There are lots of resources about setting up your local Antlr environment. I personally learned most from tomassetti. I prefer to use Maven to gather the required dependencies. I also use the Maven Antlr plugin to generate the Java classes based on the Lexar and Grammar rules.

I also installed Antlr using Homebrew, but you do not really need this for this blog post.

You can find the project on Github: https://github.com/jettro/search-dsl

I generally just load the Maven project into IntelliJ and get everything running from there. If you don’t want to use an IDE, you can also do this with pure Maven.

proj_home #> mvn clean install
proj_home #> mvn dependency:copy-dependencies
proj_home #> java -classpath "target/search-dsl-1.0-SNAPSHOT.jar:target/dependency/*"  nl.gridshore.searchdsl.RunStep1

Of course you can change the RunStep1 into one of the other three classes.

Antlr introduction

This blog post does not have the intention to explain all ins and outs of Antlr. But there are a few things you need to know if you want to follow along with the code samples.

  • Lexer – A program that takes a phrase and obtains tokens from the phrase. Examples of lexers are: AND consisting of the characters ‘AND’ but also the specials characters ‘&&’. Another example is a WORD consisting of upper or lowercase characters and numbers. Tokens coming out of a Lexer contain the type of the token as well as the matched characters by that token.
  • Grammar – Rules that make use of the Lexer to create the syntax of your DSL. The result is a parser that creates a ParseTree out of your phrase. For example, we have a grammar rule querythat parses a phrase like tree AND apple into the following ParseTree. The Grammar rule is: query : term (AND term)+ ;.
  • ParseTree – Tree by Antlr using the grammar and lexer from the provided phrase. Antlr also comes with a tool to create a visual tree. See an example below. In this blog post we create our own parser of the tree, there are however two better alternatives. The first is using the classic Listener pattern. The other is the Visitor pattern.
    Antlr4 parse tree 1
  • Listener – Antlr generates some parent classes to create your own listener. The idea behind a a listener is that you receive events when a new element is started and when the element is finished. This resembles how for instance the SAX parser works.
  • Visitor – Antlr generates some parent classes to create your own Visitors. With a visitor you start visiting your top level element, then you visit the children, that way you recursively go down the tree. In a next blog post we’ll discuss the visitor pattern in depth.

Search DSL Basics

In this section we are going to create the DSL in four small steps. For each step we have a StepXLexerRules.g4 and a StepXSearchDsl.g4 file containing the Antlr lexer and grammar rules. Each step also contains a Java file with the name RunStepX.

Step 1

In this step we want to write rules like:

  • apple
  • apple juice
  • apple1 juice

lexer
WORD        : ([A-z]|[0-9])+ ;
WS          : [ \t\r\n]+ -> skip ;
 
grammar
query       : WORD+ ;

In all the Java examples we’ll start the same. I’ll mention the rules here but will not go into depth in the other steps.

Lexer lexer = new Step1LexerRules(CharStreams.fromString("apple juice"));
CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
 
Step1SearchDslParser parser = new Step1SearchDslParser(commonTokenStream);
Step1SearchDslParser.QueryContext queryContext = parser.query();

First we create the Lexer, the Lexer is generated by Antlr. The input is a stream of characters created using the class CharStreams. From the Lexer we obtain a stream of Tokens, which is the input for the parser. The parser is also generated by Antlr. Using the parser we can obtain the queryContext. Notice the method query. This is the same name as the first grammar rule.

In this basic example a query consists of at least one WORD and a WORD consists of upper and lower case characters and numbers. The output for the first step is:

Source: apple
WORDS (1): apple,
Source: apple juice
WORDS (2): apple,juice,
Source: apple1 juice
WORDS (2): apple1,juice,

In the next step we are extending the DSL with an option to keep words together.

Step 2

In the previous step you got the option to search for one or multiple words. In this step we are adding the option to keep some words together by surrounding them with quotes. We add the following lines to the lexer and grammar.

1
2
3
4
5
6
7
8
lexer
QUOTE   : ["];
 
grammar
query               : term ;
 
term                : WORD+|quotedTerm;
quotedTerm          : QUOTE WORD+ QUOTE ;

Now we can support queries like

apple
“apple juice”
The addition to the lexer is QUOTE, the grammar becomes slightly more complex. The query now is a term, a term can be multiple WORDs or a quoted term consisting of multiple WORDs surrounded by QUOTEs. In Java we have to check from the termContext that is obtained from the queryContext if the term contains WORDs or a quotedTerm. That is what is shown in the next code block.

Step2SearchDslParser.TermContext termContext = queryContext.term();
handleTermOrQuotedTerm(termContext);
 
private void handleTermOrQuotedTerm(Step2SearchDslParser.TermContext termContext) {
    if (null != termContext.quotedTerm()) {
        handleQuotedTerm(termContext.quotedTerm());
    } else {
        handleWordTokens(termContext.WORD());
    }
}
 
private void handleQuotedTerm(Step2SearchDslParser.QuotedTermContext quotedTermContext) {
    System.out.print("QUOTED ");
    handleWordTokens(quotedTermContext.WORD());
}

Notice how we check if the termContext contains a quotedTerm, just by checking if it is null. The output then becomes

Source: apple
WORDS (1): apple,
Source: "apple juice"
QUOTED WORDS (2): apple,juice,

Time to take the next step, this time we make it possible to specify to make it explicit to query for one term or the other.

Step 3

In this step we make it possible to make it optional for a term to match as long as another term matches. Example queries are:

  • apple
  • apple OR juice
  • “apple juice” OR applejuice

The change to the Lexer is just one type OR. The grammar has to change, now the query needs to support a term or an orQuery. The orQuery consists of a term extended with OR and a term, at least once.

1
2
3
4
5
6
lexer
OR      : 'OR' | '||' ;
 
grammar
query   : term | orQuery ;
orQuery : term (OR term)+ ;

The handling in Java is straightforward now, again some null checks and handle methods.

1
2
3
4
5
if (queryContext.orQuery() != null) {
    handleOrContext(queryContext.orQuery());
} else {
    handleTermContext(queryContext.term());
}

The output of the program then becomes:

Source: apple
WORDS (1): apple,
Source: apple OR juice
Or query: 
WORDS (1): apple,
WORDS (1): juice,
Source: "apple juice" OR applejuice
Or query: 
QUOTED WORDS (2): apple,juice,
WORDS (1): applejuice,

In the final step we want to make the OR complete by also adding an AND.

Step 4

In the final step for this blog we are going to introduce AND. With the combination of AND we can make more complicated combinations. What would you make from one AND two OR three OR four AND five. In my DSL I first do the AND, then the OR. So this would become (one AND two) OR three OR (four AND five). So a document would match if it contains one and two, or four and five, or three. The Lexer does change a bit, again we just add a type for AND. The grammar has to introduce some new terms. It is good to have an overview of the complete grammar.

query               : term | orQuery | andQuery ;
 
orQuery             : orExpr (OR orExpr)+ ;
orExpr              : term|andQuery;
 
andQuery            : term (AND term)+ ;
term                : WORD+|quotedTerm;
quotedTerm          : QUOTE WORD+ QUOTE ;

As you can see, we introduced an orExpr, being a term or an andQuery. We changed an orQuery to become an orExpr followed by at least one combination of OR and another orExpr. The query now is a term, an orQuery or an andQuery. Some examples below.

  • apple
  • apple OR juice
  • apple raspberry OR juice
  • apple AND raspberry AND juice OR Cola
  • “apple juice” OR applejuice

The java code becomes a bit boring by now, so let us move to the output of the program immediately.

Source: apple
WORDS (1): apple,
Source: apple OR juice
Or query: 
WORDS (1): apple,
WORDS (1): juice,
Source: apple raspberry OR juice
Or query: 
WORDS (2): apple,raspberry,
WORDS (1): juice,
Source: apple AND raspberry AND juice OR Cola
Or query: 
And Query: 
WORDS (1): apple,
WORDS (1): raspberry,
WORDS (1): juice,
WORDS (1): Cola,
Source: "apple juice" OR applejuice
Or query: 
QUOTED WORDS (2): apple,juice,
WORDS (1): applejuice,

Concluding

That is it for now, of course this is not the most complicated search DSL. You can most likely come up with other interesting constructs. The goal for this blogpost was to get you underway. In the next blog post I intend to discuss and show how to create a visitor that makes a real elasticsearch query based on the DSL.