Being relevant with Solr

-

Ok, I have to admit, I am not (yet) a Solr expert. However, I have been working now with elasticsearch for years and the fundamentals of obtaining relevant results for Solr and Elasticsearch are still the same. For a customer, I am working on fine-tuning their search results using an outdated Solr version. They are using Solr 4.8.1, yes an upgrade is planned in the future. Still, they want to try to improve their search results. Using my search knowledge I started getting into Solr and, I liked what I saw. I saw a query that had matching algorithms, filters to limit documents that need to be considered and on top of that lots of boosting options. So many boosting options that I had to experiment a lot to get to the right results.

In this blog post, I am going to explain what I did with Solr coming from an elasticsearch background. I do not intend to create a complete guideline on how to use Solr. I’ll focus on the bits that surprised me and the stuff that deals with tuning the edismax type of query.

A little bit of context

Imagine you a running a successful eCommerce website. Or even better, you have created a superb online shopping experience. With an excellent marketing strategy, you are generating lots of traffic to your website. But, yes there is a but, sales are not as expected. Maybe it is even a bit disappointing. You start evaluating the visits to your site using analytics. When going through your search analytics you notice that most of the executed searches do not result in a click and therefore not into sales. It looks like the results shown to your visitors for their searches are far from optimal. So we need better search results.

Getting to know your data

Before we start writing queries, we need to have an idea about our data. We need to create inverted indexes per field, or for combinations of fields, using different analyzers. We need to be able to search for parts of sentences, but also be able to boost matching sentences. We want to be able to find matches for combinations of fields. Imagine you sell books, you want people to be able to look for the latest book by Dan Brown called Origin. Users might enter a search query like Dan Brown Origin. If might become a challenge if you have structured data like: https://amsterdam.luminis.eu/2018/12/01/being-relevant-with-solr/

 

{
    "author": "Dan Brown",
    "title": "Origin"
}

How would you do it if people want to have the latest Dan Brown? What if you want to help people choose by using the popularity of books using ratings or sales. Or how to act if people want to look at all the books in the Harry Potter series. Of course, we need to have the right data to be able to facilitate our visitors with these new requirements. We also need a media_type field later on. With the media type, we can filter on all eBooks for example. So the data becomes something like the following block.

{
    "id": "9781400079162",
    "author": "Dan Brown",
    "title": "Origin",
    "release_date": "2017-10-08T00:00:00Z",
    "rating": 4.5,
    "sales_pastyear": 239,
    "media_type": "ebook"
}

Ranking requirements

Based on analysis and domain knowledge we have the following thoughts translated into requirements for the ranking of search results:

  • Recent books are more important than older books
  • Books with a higher rating are more important than lower rated books
  • Unrated books are more important than low rated books
  • Books that are sold more often in the past year are more important than unsold books
  • Normal text matching rules should be applied

Mapping data to our index

In Solr, you create a schema.xml to map the expected data to specific types. You can also use the copy_to functionality to create new fields that are analyzed differently or are a combination of the provided fields. An example could be to add a field that contains all searchable other fields. In our case, we could create a field containing the author as well as the title. This field is analyzed in the most optimal way to do matching. We add a tokenizer, but also filters for lowercasing, stop words, diacritics, and compound words. We also have fields that are more for boosting using phrases and numbers or dates. We want fields like title and author to support phrases but also full matches. With this, we got a few extra search requirements

  • Documents of which the exact author or title matches the query should be more important
  • Documents of which the title contains the words in the query in the same order are more important

With these rules, we can start to create a query and apply our matching and boosting requirements.

The Query

Creating the query was my biggest surprise when going for Solr. Another configuration mechanism is the Solrconfig.xml. This file configures the Solr node. It gives you the option to create your own endpoint for a query that comes with lots of defaults. One thing we can do for instance is to create an endpoint that automatically filters for only ebooks. We can call this endpoint to search for ebooks only. Below you’ll find a sample of the config that does just this.


       json 
       media_type:ebook 
       combined_author_title 

For our own query, we’ll need other options that Solr provides. This is called the edismax query. This comes by default with options to boost your results using phrases, but also for boosting using ratings, release dates, etc. Below an image giving you an idea of what the query should do. Next, I’ll show you how this translates into the Solr configuration

 


       json
       2
       3
       author^4 title^2
       author^4 title^2
       author^4 title^2
       author_full^5 title_full^5
       product(div(def(rating,4),4),recip(ms(NOW/DAY,releasedate),3.16e-11,1,1),log(product(5,sales_pastyear)))
       combined_author_title
       edismax
       false

I am not going over all the different parameters. For multi-term queries we use phrases. These are configured with pf, pf2, and pf3. Also, mm is used for multi-term queries. This has to do with the number of terms that have to match. So if you use three terms, they all have to match. The edismax query also supports using AND/OR when you need more control over what terms to match. With lowercaseOperators we prevent that and/or in lowercase are also used to create your own boolean query. With respect to boosting there is the bq, these numbers are added to the score. With the field boost, we do a multiplication. Look also at the diagram. Also notice the bq has text related scores, while the boost has numeric scores. That is about it for now. I think it is good to look at the differences between Solr en Elasticsearch. I like the idea of creating a query with Solr. Of course, you can do the same with Elasticsearch. The json API for creating a query is really flexible, but you have to create the constructs used in Solr yourself.