Using Compound words in Elasticsearch

-

From one of our customers, we got a question about using compound words in queries. Compound words are essential in languages like Dutch and German.

Some examples in Dutch are zomervakantie, kookboek and koffiekopje. When a user enters koffiekopje, we want to find documents containing koffie kop as well as koffiekop and of course koffiekopje. In Elasticsearch, and other Lucene based search engines, you can use the Compound Word Token Filter to accomplish this.

There are two different versions, the Hyphenation decompounder, and the Dictionary decompounder. Reading the documentation, you’ll learn that you should always use the Hyphenation one. The hyphenation breaks up the terms in their hyphens. By combining adjacent hyphens, we create terms and check them with a provided dictionary. With a match, the term gets added to the query, just like a synonym.

An example of hyphenation is:
Koffiekopje -> kof – fie – kop – je
These hyphens would potentially result in the terms:
koffie, koffiekop, kop, kopje, koffiekopje.

As with other things with the Dutch language, I was a bit skeptic about the results of the filter. Therefore I decided to have a look at the implementation and try it out before I started using it for real. I had to create a class and do some sub-classing to get access to protected parameters and methods. But I was able to distill the mechanism as used by the filter and the appropriate Lucene classes.

The first step is the hyphenation part. For this part, you use the Lucene class HyphenationTree. The following piece of code shows the construction of the hyphenation tree using the mentioned XML filefrom the Objects For Formatting Objects project.

public TryHyphenation(String hyphenRulesPath) {
    HyphenationTree tree = new HyphenationTree();
    try (InputStream hyphenRules = new FileSystemResource(hyphenRulesPath).getInputStream()) {
        InputSource source = new InputSource(hyphenRules);
        tree.loadPatterns(source);
    } catch (IOException e) {
        LOGGER.error("Problem while loading the hyphen file", e);
    }
    this.tree = tree;
}

The constructor gives us access to the HyphenationTree containing the rules. Now we can ask for the hyphens of a string we can choose our selves. The result is an array with numbers. Each number represents the start of a new hyphen. The following code block finds a list of strings containing the found hyphens. Printing the hyphens is just a matter of joining the strings with a separator.

public List hyphenate(String sourceString) {
    Hyphenation hyphenator = this.tree.hyphenate(sourceString, 1, 1);
    int[] hyphenationPoints = hyphenator.getHyphenationPoints();
    List parts = new ArrayList<>();
    for (int i = 1; i < hyphenationPoints.length; i++) {
        parts.add(sourceString.substring(hyphenationPoints[i-1], hyphenationPoints[i]));
    }
    return parts;
}

TryHyphenation hyphenation = new TryHyphenation(HYPHEN_CONFIG);
String sourceString = "Koffiekopje";
System.out.println("*** Find Hyphens:");
List hyphens = hyphenation.hyphenate(sourceString);
String joinedHyphens = StringUtils.arrayToDelimitedString(
        hyphens.toArray(), " - ");
System.out.println(joinedHyphens);

Running the code results in the following hyphens (or output).

*** Find Hyphens:
Kof - fie - kop - je

Next step is finding the terms we want to search for based on the provided compound word. The elasticsearch analyzer uses the Lucene class HyphenationCompoundWordTokenFilter to find terms out of compound words. We can use this class in our sample code as well; we have to extend it to get access to the protected tokens variable. Therefore we created this following sub-class.

private class AccessibleHyphenationCompoundWordTokenFilter extends HyphenationCompoundWordTokenFilter {
    public AccessibleHyphenationCompoundWordTokenFilter(TokenStream input, 
                                                        HyphenationTree hyphenator, 
                                                        CharArraySet dictionary) {
        super(input, hyphenator, dictionary);
    }

    public List getTokens() {
        return tokens.stream().map(compoundToken -> compoundToken.txt.toString())
                .collect(Collectors.toList());
    }
}

With the following code, we can find the tokens that are available in our dictionary that are equal to found hyphens or combinations of hyphens. This class is not meant for our way of using it. Therefore the code looks a bit weird. But it does help us to understand what happens. We need a tokenizer; we use the Standard tokenizer from Lucene. We also need a reader with access to the string that needs to be tokenized. Next, we create the CharSetArray containing our dictionary of terms to find. With the HyphenationTree, the tokenizer and the dictionary we create the AccessibleHyphenationCompoundWordTokenFilter. After calling the internal methods of the filter, we can call our method with access to the internal variable tokens.

public static final List DICTIONARY = Arrays.asList("koffie","kop", "kopje");
public List findTokens(String sourceString) {
    StandardTokenizer tokenizer = new StandardTokenizer();
    tokenizer.setReader(new StringReader(sourceString));

    CharArraySet charArraySet = new CharArraySet(DICTIONARY, true);
    AccessibleHyphenationCompoundWordTokenFilter filter = 
            new AccessibleHyphenationCompoundWordTokenFilter(tokenizer, tree, charArraySet);
    try {
        filter.reset();
        filter.incrementToken();
        filter.close();
    } catch (IOException e) {
        LOGGER.error("Could not tokenize", e);
    }
    return filter.getTokens();
}

Now we have the terms from the compound word that is also in our dictionary.

System.out.println("\n*** Find Tokens:");
List tokens = hyphenation.findTokens(sourceString);
String joinedTokens = StringUtils.arrayToDelimitedString(tokens.toArray(), ", ");
System.out.println(joinedTokens);

*** Find Tokens:
Koffie, kop, kopje

Using this test class is nice, but now we want to use it within elasticsearch. The following link is a reference to a gist containing the commands to try it out in Kibana Console. Using this sample, you can play around and investigate the effect of the HyphenationCompoundWordTokenFilter. Don’t forget to install the Dutch language file in the config folder of elasticsearch. Compound Word Token Filter Instalation

Gist containing java class and Kibana Console example