Migrating from GSA to Elasticsearch with ANTLR

Analytics ElasticSearch Search

Starting January 1st of 2019, the Google Search Appliance (GSA) is set to be EOL. For one of my clients, we have chosen Elasticsearch as our alternative from 2019 onwards. However, there was one problem; the current GSA solution is in use by several API’s that can’t be changed for various reasons. Therefore, it was up to us to replicate the GSA’s behaviour with the new Elasticsearch implementation and come migration time, to swap out the GSA for Elasticsearch with as little functional changes as possible.

The GSA provides a range of functionality, some of which is easily implemented with other technologies. In our case, this included of course search functionality, but also other things such as website crawling. The part that I found most interesting however, was the GSA’s ability to enable users to form queries based on two small domain specific languages (DSL).

Queries in these two DSL’s reach the GSA as query parameters on the GSA URL. The first DSL, specified with query parameter q, has three “modes” of functionality:

  • Free text search, by simply putting some space separated terms
  • allintext search, a search query much like the free text search, but excluding fields such as metadata, anchors and URL’s from the search
  • inmeta search, which can potentially do a lot, but in our case was merely restricted to searches on metadata of the form key=value.

The second DSL, specified with query parameter partialFields also provides searching on metadata. In this case, searches are of the form (key:value) and may be combined with three boolean operators:

  • .: AND
  • |: OR
  • -: NOT

An example query could then be (key1:value1)|(key2.value2).

In this blog, I will explain how to implement these two DSL’s using ANTLR and I will show you how ANTLR enables us to separate the parsing of the DSL from our other application logic.

If this is your first time working with ANTLR, you may want to read two posts ([1][2]) that have been posted on our blog earlier.

If you are looking for the complete implementation, then please refer to the Github repository.

Parsing the GSA DSL

Let us start with parsing the q DSL. I have split the ANTLR grammar into a separate parser and lexer file for readability.

The parser is as follows:

apache
    parser grammar GsaQueryParser;
    
    options { tokenVocab=GsaQueryLexer; }
    
    query   : pair (OR? pair)*  #pairQuery
            | TEXT+             #freeTextQuery;
    
    pair    : IN_META TEXT+                 #inmetaPair
            | ALL_IN_TEXT TEXT (OR? TEXT)*  #allintextPair;
    

And the lexer is defined below:

html
lexer grammar GsaQueryLexer;

ALL_IN_TEXT : 'allintext:';
IN_META     : 'inmeta:';

OR          : 'OR';

TEXT        : ~(' '|'='|':'|'|'|'('|')')+;

WHITESPACE  : [ \t\r\n]+ -> skip;
IGNORED     : [=:|()]+ -> skip;

Note that the parser grammar reflects the two different ways that the q DSL can be used; by specifying pairs or by simply putting a free text query. The pairs can be separated by an OR operator. Furthermore, the allintext keyword may separate terms with OR as well. The definition of the partialFields DSL is somewhat different because it allows for query nesting and more boolean operators. Both the parser and the lexer are shown below, again in two separate files. Parser:

html
parser grammar GsaPartialFieldsParser;

options { tokenVocab=GsaPartialFieldsLexer; }

query       : pair
            | subQuery;

subQuery    : LEFTBRACKET subQuery RIGHTBRACKET
            | pair (AND pair)+
            | pair (OR pair)+
            | subQuery (AND subQuery)+
            | subQuery (OR subQuery)+
            | subQuery AND pair
            | subQuery OR pair
            | pair AND subQuery
            | pair OR subQuery;

pair        : LEFTBRACKET KEYWORD VALUE RIGHTBRACKET        #inclusionPair
            | LEFTBRACKET NOT KEYWORD VALUE RIGHTBRACKET    #exclusionPair
            | LEFTBRACKET pair RIGHTBRACKET                 #nestedPair;

Lexer:

html
lexer grammar GsaPartialFieldsLexer;

AND         : '.';
OR          : '|';
NOT         : '-';

KEYWORD     : [A-z0-9]([A-z0-9]|'-'|'.')*;
VALUE       : SEPARATOR~(')')+;

SEPARATOR   : [:];
LEFTBRACKET : [(];
RIGHTBRACKET: [)];
WHITESPACE  : [\t\r\n]+ -> skip;

Note the usage of labels in both grammars which, in the above case, allows me to easily distinguish different types of key-value pairs; nested, inclusive or exclusive. Furthermore, there is a gotcha in the matching of the VALUE token. To make a clear distinction between KEYWORD and VALUE tokens, I’ve included the : as part of a VALUE token.

Creating the Elasticsearch query

Now that we have our grammars ready, it’s time to use the parse tree generated by ANTLR to construct corresponding Elasticsearch queries. I will post some source code snippets, but make sure to refer to the complete implementation for all details.

For both DSL’s, I have chosen to walk the tree using the visitor pattern. We will start be reviewing the qDSL.

Creating queries from the q DSL

The visitor of the q DSL extends a BaseVisitor generated by ANTLR and will eventually return a QueryBuilder, as indicated by the generic type:

html
public class QueryVisitor extends GsaQueryParserBaseVisitor<QueryBuilder>

There are three cases that we can distinguish for this DSL: a free text query, an allintext query or an inmeta query. Implementing the free text and allintext query means extracting the TEXT token from the tree and then constructing a MultiMatchQueryBuilder, e.g.:

html
@Override
public QueryBuilder visitFreeTextQuery(GsaQueryParser.FreeTextQueryContext ctx) {
    String text = concatenateValues(ctx.TEXT());
    return new MultiMatchQueryBuilder(text, "album", "artist", "id", "information", "label", "year");
}

private String concatenateValues(List<TerminalNode> textNodes) {
    return textNodes.stream().map(ParseTree::getText).collect(joining(" "));
}

The fields that you use in this match query depend on the data that is in Elasticsearch – in my case some documents describing music albums. An inmeta query requires us to extract both the field and the value, which we then use to construct a MatchQueryBuilder, e.g.:

apache
@Override
public QueryBuilder visitInmetaPair(GsaQueryParser.InmetaPairContext ctx) {
    List<TerminalNode> textNodes = ctx.TEXT();

    String key = textNodes.get(0).getText().toLowerCase();
    textNodes.remove(0);
    String value = concatenateValues(textNodes);

    return new MatchQueryBuilder(key, value);

We can then combine multiple pairs by implementing the visitPairQuery method:

html
@Override
public QueryBuilder visitPairQuery(GsaQueryParser.PairQueryContext ctx) {
    BoolQueryBuilder result = new BoolQueryBuilder();
    ctx.pair().forEach(pair -> {
        QueryBuilder builder = visit(pair);
        if (hasOrClause(ctx, pair)) {
            result.should(builder);
            result.minimumShouldMatch(1);
        } else {
            result.must(builder);
        }
    });
    return result;
}

Based on the presence of OR clauses we either create a should or must boolean clause for our Elasticsearch query.

Creating queries from the partialFields DSL The visitor of the partialFields DSL also extends a BaseVisitor generated by ANTLR and also returns a QueryBuilder:

html
public class PartialFieldsVisitor extends GsaPartialFieldsParserBaseVisitor<QueryBuilder>

There are three kinds of pairs that we can specify with this DSL (inclusion, exclusion or nested pair) and we can override a separate method for each option, because we labelled these alternatives in our grammar. A nested pair is simply unwrapped and then passed back to ANTLR for further processing:

html
@Override
public QueryBuilder visitNestedPair(GsaPartialFieldsParser.NestedPairContext ctx) {
    return visit(ctx.pair());
} 

The inclusion and exclusion query implementations are quite similar to each other:

html
@Override
public QueryBuilder visitInclusionPair(GsaPartialFieldsParser.InclusionPairContext ctx) {
    return createQuery(ctx.KEYWORD().getText(), ctx.VALUE().getText(), false);
}

@Override
public QueryBuilder visitExclusionPair(GsaPartialFieldsParser.ExclusionPairContext ctx) {
    return createQuery(ctx.KEYWORD().getText(), ctx.VALUE().getText(), true);
}

private QueryBuilder createQuery(String key, String value, boolean isExcluded) {       
    value = value.substring(1);

    if (isExcluded) {
        return new BoolQueryBuilder().mustNot(new MatchQueryBuilder(key, value));
    } else {
        return new MatchQueryBuilder(key, value).operator(Operator.AND);
    }
}

Remember that we included the : to help our token recognition? The code above is where we need to handle this by taking the substring of the value. What remains is to implement a way to handle the combinations of pairs and boolean operators. This is done by implementing the visitSubQuerymethod and you can view the implementation here. Based on the presence of an AND or OR operator, we apply must or should clauses, respectively.

Examples

In my repository, I’ve included a REST controller that can be used to execute queries using the two DSL’s. Execute the following steps to follow along with the examples below:

  • Start an Elasticsearch instance at http://localhost:9200 (the application assumes v6.4.2)
  • Clone the repository: git clone https://github.com/markkrijgsman/migrate-gsa-to-elasticsearch.git && cd migrate-gsa-to-elasticsearch
  • Compile the repository: mvn clean install
  • Run the application: cd target && java -jar search-api.jar
  • Fill the Elasticsearch instance with some documents: http://localhost:8080/load
  • Start searching: http://localhost:8080/search

You can also use the Swagger UI to execute some requests: http://localhost:8080/swagger-ui.html. For each example I will list the URL for the request and the resulting Elasticsearch query that is constructed by the application.

Get all albums mentioning Elton John
http://localhost:8080/search?q=Elton

html
GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "Elton",
            "fields": [
              "album",
              "artist",
              "id",
              "information",
              "label",
              "year"
            ],
            "type": "best_fields",
            "operator": "AND",
            "lenient": true,
          }
        }
      ]
    }
  }
}

Get all albums where Elton John or Frank Sinatra are mentioned http://localhost:8080/search?q=allintext:Elton%20OR%20Sinatra

html
GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must": [
              {
                "multi_match": {
                  "query": "Elton Sinatra",
                  "fields": [
                    "album",
                    "artist",
                    "id",
                    "information",
                    "label",
                    "year"
                  ],
                  "type": "best_fields",
                  "operator": "OR",
                  "lenient": true
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Note that the operator for the multi match query is now OR, where it was AND in the previous example.

Get all albums where the artist is Elton John http://localhost:8080/search?partialFields=(artist:Elton)

html
GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "artist": {
              "query": "Elton",
              "operator": "AND"
            }
          }
        }
      ]
    }
  }
}

Get all albums where Elton John is mentioned, but is not the artist http://localhost:8080/search?partialFields=(-artist:Elton)&q=Elton

html
GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must_not": [
              {
                "match": {
                  "artist": {
                    "query": "Elton",
                    "operator": "OR"
                  }
                }
              }
            ]
          }
        },
        {
          "multi_match": {
            "query": "Elton",
            "fields": [
              "album",
              "artist",
              "id",
              "information",
              "label",
              "year"
            ],
            "type": "best_fields",
            "operator": "AND",
            "lenient": true
          }
        }
      ]
    }
  }
}

Get all albums created by Elton John between 1972 and 1974 for the label MCA http://localhost:8080/search?partialFields=(artist:Elton).(label:MCA)&q=inmeta:year:1972..1974

html
GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must": [
              {
                "match": {
                  "artist": {
                    "query": "Elton",
                    "operator": "AND"
                  }
                }
              },
              {
                "match": {
                  "label": {
                    "query": "MCA",
                    "operator": "AND"
                  }
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "range": {
                  "year": {
                    "from": "1972",
                    "to": "1974",
                    "include_lower": true,
                    "include_upper": true
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Please refer to the unittests for a lot more examples: Unittest examples for DSL q Unittest examples for DSL partialFields

Conclusions

As you can see, the usage of ANTLR allows us to specify fairly complex DSL’s without compromising readability. We’ve cleanly separated the parsing of a user query from the actual construction of the resulting Elasticsearch query. All code is easily testable and makes little to no use of hard to understand regular expressions.

A good addition would be to add some integration tests to your implementation, which you can learn more about here. If you have any questions or comments, let me know!

Analytics ElasticSearch Search