Reliable Indexation using ApacheSolr

You are here

On one of SMISS projects we have faced a necessity of reliable data indexing using Apache Solr. What is ApacheSolr and how it should be used? Is it possible to achieve really reliable results with its help? 

24 July 2015

Situation

On one of our projects we have faced a necessity of reliable data indexing using Apache Solr.

Apache Solr - search engine based on the library Apache Lucene (actually both these projects are created by the same team of developers); it has great functionality, including full-text search, result highlighting, rich-document processing (Word, PDF, etc.), faceted search (i.e. search by categories), geolocation search, etc. Apache Solr supports "out of the box" integration with databases, replication (and also centralized configuration, load balancing, automatic sharding, distributed indexing, and so on). This search engine is implemented in Java as a Web application that works in a servlet container. 

ApacheSolrUsing the word “reliable” we mean “done for sure”, and not “sent for indexing and forgot”. The point is that the usual approach to this process is using a client (e.g., SolrJ in case of Java), which produces HTTP requests with sending the data in the appropriate format to the server Solr. The problem of such approach is that in case a server with Apache Solr is down, we can either abandon the indexation (it’s not an option) or repeat the request and wait (blocking the processing of next data).

Fortunately, Solr provides another mechanism for indexing - Data Import Handler. They allow you to customize the data source indexing by providing queries from the server itself. For this data source (often a relational database) should be able to provide data preferably in the page-by form and with filtration by, for example, date (to be able to configure indexing, for example, every hour and request only the new data). Alas, in our case it didn’t work out - the system uses a non-standard component for storing and processing data that are not database in the classic sense of this term and can not do what we have just described.

Since specific for Solr approach is not applicable here, we had to turn to traditional means. In such situations, queues are often used – i.e. we send the information, required for indexing, to the queue, and the consumer reads it out and tries to make a request to the server; if it fails, the message remains in the queue, waiting for better times (ie, when the Solr server is running). In general, this could have been done: we send information for indexing (it may be an object) in a queue from the service of data processing, and other service performs requests for indexing. However, there is a nuance - we have a custom conversion of server requests and responses, and also the ability to make queries that are quite different from those provided by standard functionality. These features have already been implemented as a plugin for Solar. And as there was already one place where all the "custom" features connected with requests were concentrated, it was decided to place indexing next to it. In other words, the consumption of the queue should have been made out of plugin.

Below you will find a description of the general approach, some of the implementation details, as well as solving of problems we faced in the process of development.

Implementation

Before describing the details we should note that in Apache Solr’s wiki you may find an article about “hooks” that can be connected to your own code. This article also includes general description of approach and development. We suggest that everyone who plans to write his own plugin should firstly read this article. 

Hook choice and initialization

So the first question we should solve is where to place code for messages consumption. There are several variants (take a look at the link above), in our case UpdateRequestProcessorFactory was already used for additional indexation request processing (additional field generation). That’s why we decided to place queue initialization also there (indirectly).  The main advantage of such hook is that our UpdateRequestProcessorFactory may implement interface SolrCoreAware, that gives access to current core, and also resources from core folder (using ResourceLoader).

So we will use method SolrCoreAware.inform(SolrCore core) for initialization, where we will create our “indexator”:

@Override
public void inform(SolrCore core) {
        	// the user queue is initialized in the constructor
        	indexUpdater = new CustomIndexUpdater(core);
 
        	// do not forget to release resources correctly at core stop
                core.addCloseHook(new CloseHook() {
                    	@Override
                    	public void preClose(SolrCore core) {
                    	        	indexUpdater.shutdown();
                    	}
                    	
                    	@Override
                    	public void postClose(SolrCore core) {
                    	}
        	});
}

When do we run SolrCoreAware.inform (SolrCore core)?

This method is run when everything is "almost ready". In other words, when core and plug-ins are being loaded constructors run first (obviously), then the methods init (Map / NamedList / PluginInfo), that are present in certain classes of hooks, further methods inform (ResourceLoader) for classes that implement the interface ResourceLoaderAware, and only after that - SolrCoreAware.inform (SolrCore core).

We will omit details of message consumption - each case will have its own logic, and the topic of the article is different. There is another interesting thing – what should be done with received data.

Indexation

To be precise, the question is following – on what level requests for indexation should be sent in order “to not break anything”. The fact is that there are two obvious ways to do this when you have SolrCore.

The first one is to run core.getUpdateHandler() and use received UpdateHandler. But there is a nuance that is called SolrCloud. When we work "in cloud" (mostly - in replication mode "out of the box") in the normal chain of updates processors (requests for indexing) additional processor is embedded that sends the update to other nodes, thereby ensuring synchronization of all cluster replicas work. And the problem is that UpdateHandler works at a lower level - this is the class directly responsible for the work of a particular index on a given node. So using it, we will update only the data on the current server, and the others remain out of work. We used SolrCloud, that’s why such situation didn’t suit us. 

Of course, if you have only one available Apache Solr server and you didn’t additionally embed your own processors into updateProcessingChain, then UpdateHandler may be your choice - but first make sure that you understand the difference with the following approach.  

We have already mentioned its main idea - using updateProcessingChain. Chain itself, however, does not handle requests – first a processor should be created (note that this should be done for each request for indexing). In such way we will get next code:

// chain name is used as parameter, null means using default option
UpdateRequestProcessorChain processorChain = core.getUpdateProcessingChain(null);
UpdateRequestProcessor updateProcessor = processorChain.createProcessor(new LocalSolrQueryRequest(core, new ModifiableSolrParams()), new SolrQueryResponse());

In order to index the data obtained using the processor, pay attention to its process* methods, they speak for themselves: processAdd (AddUpdateCommand), processDelete (DeleteUpdateCommand) etc.

Transaction and commits

Of course, reliable queue consumption means a transactionality of the process (or at least makes you think about it). Using JMS-transaction when indexing encourages to draw a parallel between the transaction commits and Apache Solr commits and use them, for example, simultaneously. However, this is not the best idea - frequent commits, especially when having index of a large size, will lead to a decrease in performance, even when using a soft commit. Apache Solr developers also tell about this and recommend to set up automatic commits configured at certain intervals (for example, soft commit every minute, hard commit every half hour).

Configuration (+ nuance for Spring)

 

In fact, all described above is enough for the realization of our mechanism. However, a nice addition would be a possibility to adjust the whole process. For example, it is easy to imagine configurable settings for JMS or any indexation business rules. Here we will again use implemented interface SolrCoreAware - core object gives us access to the resource loader that can be used in our selfish purposes

SolrResourceLoader resourceLoader = core.getResourceLoader();
try (InputStream resource = resourceLoader.openResource(“custom-solr-config.xml”)) {
      // load file and make necessary settings 
} catch (IOException e) {
      throw new SettingsLoadingException("Settings unavailable", e);
}

Loader looks for a file in a folder with core configuration (conf /). Of course, if you use SolrCloud and store all files in Zookeeper, then this file will be loaded from there.

 

What about Spring?

This framework is quite popular and often many system components performing a variety of functions have something in common - the use of spring beans for dependency injection. It is natural to consider using a familiar technology in the case of configuring anything in plugins of Apache Solr.

So here is one more nuance about classes uploading performed by Spring. When you try to use it in its usual way (creating a context based on the XML-file, loading the bean), you will meet a problem - bean class won't be found (unless, of course, it has not been used and, accordingly, has not been loaded in JVM). We didn't study this problem in detail, but nevertheless a solution was found - using a class loader, which is available in ResourceLoader. The following code provides a universal wrapper to work with anything (not necessarily Spring), that loads classes “manually”.

/**
 * Abstraction for operations to be executed with the specific class loader
 * @param <T> return type of {@link #work()}
 * @see SolrUtils#withClassLoader(ClassLoader, Action)
 */
public static interface Action<T> {
	/**
	 * Operations to be executed are to be put herein
	 * @throws CustomException any exception that operations inside can throw should be caught and
	 * translated into {@link CustomException}
	 */
	T work() throws CustomException;
}

public class SolrUtils {
/**
	 * Execute something using provided class loader. Useful for things when custom loading of classes is used like Spring.
	 * We need to use Solr Core's internal class loader because otherwise library classes are unavailable.
	 * <p>
	 * It sets current thread's class loader to given one, executes given operations 
	 * and then reverts class loader to original one.
	 * @param <T> type of action result
	 * 
	 * @param classLoader class loader to use
	 * @param action operations to be performed
	 * @throws CustomException action can throw this exception
	 */
public static <T> T withClassLoader(ClassLoader classLoader, Action<T> action) throws CustomException {
		ClassLoader contextClassLoader = Thread.currentThread().getContextClassLoader();
		Thread.currentThread().setContextClassLoader(classLoader);
		
		T result;
		try {
			result = action.work();
		}
		finally {
			Thread.currentThread().setContextClassLoader(contextClassLoader);
		}

		return result;
	}
}

Here is how loading of bean will look like when using code mentioned above:

CustomBean bean = SolrUtils.withClassLoader(resourceLoader.getClassLoader(), new SolrUtils.Action<CustomBean>() {
	public CustomBean work() throws SettingsLoadingException {
		try (InputStream resource = resourceLoader.openResource(CUSTOM_SOLR_CONFIG)) {
			try (ResourceXmlApplicationContext applicationContext = new ResourceXmlApplicationContext(new ByteArrayResource(IOUtils.toByteArray(resource)))) {
				return (CustomBean) applicationContext.getBean(beanName);
			}
		} catch (IOException e) {
			throw new SettingsLoadingException("Local spring context unavailable", e);
		}
	}
});

Conclusion

Despite the fact that the article describes a specific solution to a problem, I think, that the described approaches and advices can also be used in other situations, connected with development of your own plug-ins for Apache Solr and not only it. For example, you may need to inherit UpdateRequestProcessorFactory for custom indexing query. The way of using the "external" class loader that was described above can be useful when writing OSGi-applications.

Because of the necessity to stay within the article topic, some points were described "as it is" - Solr Cloud infrastructure, our own search and indexing handlers; these things can be discussed in detail in future articles if there is readers' interest.