Solr managed resources filters

-

Solr has a nice functionality called managed resources. Solr provides a couple of them but makes it quite easy to add your own. In this blog post, I’d like to take a concrete example and go through through the steps of writing the code for a custom managed resource and making it available to Solr, in order to be used during analysis.

It is commonly known that spell checking are not perfect. Even when using a dictionary based stem factory like HunspellStemFilterFactory, you will still have words incorrectly stemmed. You can dig into the hunspell affix file and add rules to fix the issues, but it is sometimes easier to just ignore the word.

Solr provides a filter factory that does just that. KeywordMarkerFilterFactory uses a plain text file containing the words to be ignored by stemming filters. Maintaining such a list is not very easy, you need to upload it with the config set and any change requires updating the config set. That is not very manageable, it would be better to leverage the power of managed resources in order to edit the list. You could even build a UI so other users can maintain this list, after all this can be considered content as well.

BaseManagedTokenFilterFactory is an abstract class that does all the heavy leverage so to write a managed filter factory, we need to extend on it. Normally you would wrap the ‘standard’ factory while writing a managed resource. In our case KeywordMarkerFilterFactory is pretty simple.

The interesting part is:

  public TokenStream create(TokenStream input) {
    if (pattern != null) {
      input = new PatternKeywordMarkerFilter(input, pattern);
    }
    if (protectedWords != null) {
      input = new SetKeywordMarkerFilter(input, protectedWords);
    }
    return input;
  }

This factory can produce two type of objects: SetKeywordMarkerFilter or PatternKeywordMarkerFilter. We are interested in SetKeywordMarkerFilter.

This is how the managed resource factory looks like:

package my.package;

import org.apache.lucene.analysis.CharArraySet;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.miscellaneous.SetKeywordMarkerFilter;
import org.apache.solr.common.SolrException;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.rest.ManagedResource;
import org.apache.solr.rest.schema.analysis.BaseManagedTokenFilterFactory;
import org.apache.solr.rest.schema.analysis.ManagedWordSetResource;

import java.util.Map;
import java.util.Set;

/**
 * TokenFilterFactory that uses the ManagedWordSetResource implementation
 * for managing stop words using the REST API.
 */
public class ManagedKeywordMarkerFilterFactory extends BaseManagedTokenFilterFactory {

    private CharArraySet protectedWords = null;

    public ManagedKeywordMarkerFilterFactory(final Map args) {
        super(args);
    }

    @Override
    public String getResourceId() {
        return "/schema/analysis/protwords/" + handle;
    }

    protected Class getManagedResourceImplClass() {
        return ManagedWordSetResource.class;
    }

    @Override
    public void onManagedResourceInitialized(final NamedList args, final ManagedResource res) throws SolrException {
        final Set managedWords = ((ManagedWordSetResource)res).getWordSet();
        boolean ignoreCase = args.getBooleanArg("ignoreCase");
        protectedWords = new CharArraySet(managedWords.size(), ignoreCase);
        protectedWords.addAll(managedWords);
    }


    @Override
    public TokenStream create(final TokenStream input) {
        if (protectedWords == null) {
            throw new IllegalStateException("Managed protwords not initialized correctly!");
        }
        return new SetKeywordMarkerFilter(input, protectedWords);
    }
}

getResourceId(…) returns the location that should be used to hit the managed resource (the handle is an extra path parameter that is added when instantiating the factory see manage bellow)

onManagedResourceInitialized(…) initializes the list (it goes without saying that we are not expecting a huge list!)

create(…) instantiates a filter that will load the list of word that should be “protected” against stemmers.

The next step is to package this class into a JAR file using the way you are used to. Then this JAR file can be made available to Solr by placing it into the dist sub-folder of the Solr installation.

The JAR file can then be loaded by adding a lib definition in the solrconfig.xml file of the collection that will make use of it.


    8.2.0
    .
    .
    .
    
    .
    .
    .

After restarting Solr, you can use the factory in your schema.


    .
    .
    .
    
    .
    .
    .

This is the handle mentioned in the factory implementation.

The new managed resource is accessible as it is for any other managed resource.

For example: http://localhost:8981/solr/<my-collection>/schema/analysis/protwords/en

You can apply this to all kind of use cases, the bottom line, if you need to maintain a dynamic list, check out managed resources. Creating your own is as easy as following the steps above. For a more complex case, check out ManagedSynonymFilterFactory.