Updatable synonyms in Elasticsearch @Bol.com

-

With around 6.2 million active customers, Bol.com is the number one online retailer in The Netherlands. One of the most important components of an E-commerce platform is having a good search engine. Luminis Amsterdam is helping Bol.com implement Elasticsearch for search capabilities.

Synonyms in Elasticsearch

A common feature of search engines is the ability to search using synonyms. In the UK people use the word bag, whereas people in the US use the word purse for the same thing. What if our documents are UK-oriented and don’t contain the word purse? We would still like to help our US customers, by adding a synonym that either replaces the word purse with bag or adds bag to the query as an extra search term. Whether you want to use synonym expansion or synonym contraction depends on your use-case and won’t be part of this post.

Configuring synonyms in Elasticsearch can be done in two ways. Either by adding your synonyms directly to a SynonymTokenFilter like so:

PUT luminis-synonyms
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym": {
            "tokenizer": "whitespace",
            "filter": [
              "synonym"
            ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms": [
              "bag, purse, handbag"
            ]
          }
        }
      }
    }
  }
}

Or by placing a file on your server and referencing that file with the synonym_path property:

PUT luminis-synonyms
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym": {
            "tokenizer": "whitespace",
            "filter": [
              "synonym"
            ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms_path": "analysis/synonym.txt"
          }
        }
      }
    }
  }
}

Pretty easy to setup and it works within minutes. Now there is one caveat here: What if we want to update the synonyms after the analyzer has been created? Bol.com wants to have full control of their synonyms and the ability to update them at any given time. The official documentation says the following about updating synonyms:

If you specify synonyms inline with the synonyms parameter, your only option is to close the index and update the analyzer configuration with the update index settings API, then reopen the index.

Updating synonyms is easier if you specify them in a file with the synonyms_path parameter. You can just update the file (on every node in the cluster) and then force the analyzers to be re-created by either of these actions:

  • Closing and reopening the index (see open/close index), or
  • Restarting each node in the cluster, one by one

Of course, updating the synonym list will not change any documents that have already been indexed. It will apply only to searches and to new or updated documents. To apply the changes to existing documents, you will need to reindex your data.

What if we want to add more synonyms on our production environment? It’s not possible for us to use index-time synonyms since that requires reindexing the data, but if we would just use query-time synonyms we would have to restart our production nodes or close/open the index. Not very feasible for a production environment.

Elasticsearch has support for creating your own plugins and effectively extending the functionality of the engine. In order to meet Bol.com’s requirement to update synonyms on the fly, we created a plugin that does just that. Since we’re using query-time synonyms, our plugin extends the functionality of the SynonymTokenFilter by having a updatable synonymMap. We added custom REST handlers that will process new synonyms after they’ve been uploaded to Elasticsearch. Now people responsible for maintaining synonyms at bol.com can add/remove synonyms on the fly, whereas with their current Endeca setup it takes one day for new synonyms are processed.

How the plugin works

As mentioned above I created a new _synonym endpoint in Elasticsearch which allows the following functionality:

  • GET: Retrieve the current list of synonyms
  • POST/PUT: Upload a new synonym file containing the new updated synonym file
  • DELETE: Delete all known synonyms

Since we’re working in a distributed environment, we have to let other nodes in the cluster know that a new file has been uploaded. As soon as the node receiving the file has processed the new synonyms, it will store them inside an index in Elasticsearch. After indexing the new synonyms the plugin will broadcast to the other nodes in the cluster that new synonyms are available, which in turn leads to the other nodes applying the new file and confirming back to the original node if everything went ok. Once all nodes processed the new synonyms, we report back to the user that applying the new synonyms was successful. If an error happens during processing, we report back to the user that the operation was unsuccessful. At the time of writing, there is no automatic failover/rollback mechanism (yet), which means that we have to manually check what happened and re-apply the file.

Since synonyms are being kept in memory, we need a way for new nodes joining or restarting nodes to load in the synonyms during startup. We created a service that will check for an existing synonyms index, and load these into memory. This way we make sure that all nodes in the cluster have the same version of the synonyms file in memory.

This is just one of the cool new things we’re doing at Bol.com