Multiword synonyms in search using Querqy
There is one topic that gives even the toughest search engineers a headache, multi-word synonyms. There are some ways to sort of get a solution, one of them is a nice tool called Querqy. Querqy is created by René Kriegler.
The goal for this blog post is to introduce Querqy for Solr, give you an easy way to start experimenting using Docker. As a bonus, I’ll show you how to use another tool called SMUI, which is a graphical user interface to help you write the rules.
The past week I talked with a few people that were responsible for adding synonyms to their e-commerce search solution. It is an integration of an e-commerce platform and Solr 4.8.1. There are limitations due to the Solr version. Next to that, there are lots of constraints due to the integration with the e-commerce platform. I had to explain the multi-word synonyms problem to them. If you have not heard from this before and like to learn more about the technical side, check the references. The editors finally found a tool that they can use to influence results returned to users. But now I was telling them they could not use it, as they fixed one product search, but they broke numerous others. When thinking about a solution, I remembered the talk from Rene Kriegler about Querqy. Querqy is a query re-write tool that accepts lots of rules to specify what to re-write. I could not find a version that I could use for Solr 4.8.1. Still, I wanted to have a good look at Querqy.
Setting up the test environment using Docker
If you want to try this out yourself, checkout my github repo: https://github.com/jettro/querqy-tryout I don’t want this to be an extensive Docker tutorial. I use Docker to make some things more manageable. Querqy comes as a plugin to Solr, which is not hard to accomplish using Docker. You can download the jar by following the link on the querqy website. For convenience, I have included the specific version I use in the Github repository. First, we need to create our own Docker contain with the jar file for Querqy included. Second, we create a docker-compose with three containers. For SMUI, we need a MySQL database. We need Solr, and we need to SMUI container. Below you can see the Docker container configuration for the Querqy container.
FROM solr:slim
COPY ./lib/*.jar /opt/solr/dist
EXPOSE 8983
CMD ["solr-precreate", "gettingstarted"]
COPY ./lib/ /var/solr/data/gettingstarted/conf/solrconfig.xml
Next, you can see the docker-compose.yml containing the three mentioned containers.
version: "3"
services:
solr:
container_name: my_solr
build:
context: .
dockerfile: ./Dockerfile
image: my_solr
restart: "no"
volumes:
- "./solrdata/:/var/solr/"
ports:
- "8983:8983"
db:
image: mysql:5.7
restart: "no"
environment:
MYSQL_DATABASE: 'smui'
# So you don't have to use root, but you can if you like
MYSQL_USER: 'smui'
# You can use whatever password you like
MYSQL_PASSWORD: 'smui'
# Password for root access
MYSQL_ROOT_PASSWORD: 'password'
ports:
# :
- '3306:3306'
expose:
# Opens port 3306 on the container
- '3306'
# Where our data will be persisted
volumes:
- ./mysql-db:/var/lib/mysql
smui:
image: pbartusch/smui
restart: "no"
environment:
- "SMUI_DB_URL=jdbc:mysql://db:3306/smui?autoReconnect=true&useSSL=false"
- "SMUI_2SOLR_SOLR_HOST=solr:8983"
- "SMUI_2SOLR_SRC_TMP_FILE=/smui/temp/rules.txt"
ports:
- "9000:9000"
expose:
- '9000'
volumes:
- ./smui_path:/smui/temp
I still have some issues with the auto replacement of the rules.txt using the provided SMUI script. Therefore I copy the rules file as generated by SMUI by hand. There is also some default setting up commands that you need to run only once. You can find them below:
# Initialize SMUI
curl -X PUT -H "Content-Type: application/json" -d '{"name":"gettingstarted", "description":"Gettingstarted"}' http://localhost:9000/api/v1/solr-index
# Copy the solr config
cp ./lib/solrconfig.xml ./solrdata/data/gettingstarted/conf/solrconfig.xml
# Copy the rules.txt as generated by SMUI
cp ./smui_path/rules.txt ./solrdata/data/gettingstarted/rules.txt
Setting up Solr
Now we are ready to configure the Solr core. First, we need to add two fields; then, we need to add some data; next, commit the data; finally, reload the core. If you use PAW, all commands can be executed using that, otherwise you have to execute the curl commands.
## Add field title
curl -X "POST" "http://localhost:8983/solr/gettingstarted/schema" \
-H 'Content-Type: application/json' \
-H 'commitWithin: 1000' \
-H 'overwrite: true' \
-d $'{
"add-field": {
"multiValued": false,
"name": "title",
"type": "text_general",
"stored": true
}
}'
## Add field category
curl -X "POST" "http://localhost:8983/solr/gettingstarted/schema" \
-H 'Content-Type: application/json' \
-H 'commitWithin: 1000' \
-H 'overwrite: true' \
-d $'{
"add-field": {
"multiValued": false,
"name": "category",
"type": "string",
"stored": true
}
}'
## ADD all docs
curl -X "POST" "http://localhost:8983/solr/gettingstarted/update" \
-H 'Content-Type: application/json' \
-H 'commitWithin: 1000' \
-H 'overwrite: true' \
-d $'[
{
"title": "My trip to San Francisco",
"category": "travel"
},
{
"title": "My hobby is horseback riding",
"category": "sports"
},
{
"title": "IT",
"category": "movie"
},
{
"title": "IT chapter two",
"category": "movie"
},
{
"title": "IT chapter one",
"category": "movie"
},
{
"title": "Clean it",
"category": "housing"
}
]'
One step I did not mention yet is changing the Solr configuration to enable the Querqy query re-writing step. This step is well documented in the Querqy documentation. Look at the solrconfig.xml file in the git repository; the relevant part is the addition of a query parser with the name querqy. This part configures the re-write chain as well as the logging. Copy the solrconfig.xml from the lib to the Solr core configuration, again reload the core, and you are ready to start playing around.
Adding the first rule
I am not going to re-write the documentation entirely here. I do want to give you an idea about the available options. There are two parts for each rule: Input matching and output rules. Input matching – Select one or multiple words, you can specify to select them only if they are the complete query, or if the query starts with them, or if the query ends with them. Another impressive trick is you can select the starting part of a word and use the remainder in the synonym. See some of the examples below: “IT” -> Select only those queries that completely IT, no other terms “house -> Select only of the query starts with house ebook” -> Select only if the query ends with ebook sofa* -> Match query like sofabed and search for sofa bed.
Output rules – There are several different rules for you to choose and use. One that we started this blog post for, is using synonyms with multiple words. The rule I like is adding boosting to a specific category.
cheap notebook =>
UP(10): * price:[350 TO 450]
DOWN(20): * category:accessories
Besides boosting, you can add filters in almost the same way.
notebook =>
FILTER: * -category:accessories
There is also an option to remove terms, and the final rule I’d like to mention is called Decorate rules. With these rules, you can return a response that would indicate a redirect, for example.
faq =>
DECORATE: redirect, /service/faq
There are some more advanced options, look at the Querqy documentation for more information about those options.
What Querqy does
Although it is nice that Querqy does what it does, and it helps us with multi-word synonyms, you might be wondering how it works. A good trick is to use the debug option in Solr, that way you can check the query after Querqy rewrote it.
The first example is the query: trip SF The matched rule is:
SF =>
SYNONYM: San Francisco
(title:trip (title:sf | (+title:san +title:francisco)^0.5))~2
Notice the or ‘|’ between the original and the synonym clause of the query, and the negative boost for the synonym it both would match. Another interesting example is the movie IT. This is a very common word. With the next version of the movie out now, the title has become IT Chapter one and IT Chapter Two. We can create a rule to math the word IT. Problem is that the document IT chapter two has two words. However, the document Clean IT contains only two words. Based on normal scoring this would score higher than the two new movies. Therefore we add a rule to boost category movies when someone searched for IT. Below the rule in SMUI, a graphical user interface to create Querqy rules. And below that the result of sending the query IT.
I hope you now have an idea about the power of Querqy. Check the documentation for even more options.
References
https://github.com/renekrie/querqy https://github.com/pbartusch/smui https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/ https://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/ https://github.com/jettro/querqy-tryout
Want to know more about what we do?
We are your dedicated partner. Reach out to us.