Distributed Lucene Overview
NCache provides a Lucene module, which allows you to use Lucene for text searching with NCache. Each server of NCache has a dedicated Lucene module. This makes Lucene distributed, scalable, and highly available with NCache.
Note
This feature is available in NCache Enterprise and Professional for the Partitioned Topologies only.
Note
NCache uses the 4.8 version of Lucene.Net.
Why to Use Lucene with NCache?
Lucene, as we know, is a powerful and efficient search engine that provides a vast range of text-searching techniques to fulfill your business requirements. Lucene is much more than any other text search engine as the choices given to the user are multiple. It has powerful search algorithms and supports a wide range of queries for searching.
Although as powerful as Lucene is on its own, it has its limitations. Lucene runs in-process in the client application. This means that Lucene isn't scalable, and it has a single point of failure.
NCache provides a distributed implementation of Lucene with minor changes in its API. Lucene’s API calls NCache at the backend. NCache being distributed in nature with Lucene provides linear write scalability as the documents indexed by the applications are automatically distributed among cache nodes where they are separately indexed.
Note
Distributed Lucene uses a separate and dedicated Lucene store instead of the NCache cache-store.
Similarly, Distributed Lucene also provides linear read scalability since queries are propagated on each partition and results are merged. A higher number of partitions provides a higher amount of read and write scalability. Lucene indexes are persisted on your physical drive. The more nodes, the higher the scalability, performance, and storage capacity to accommodate a large number of Lucene documents and indexed data.
Working of Distributed Lucene
Important
It's highly encouraged that you use an SSD to index and search Lucene documents instead of an HDD.
The behavior and working of Lucene and Distributed Lucene are almost the same with a few changes. The workflow, data distribution, and components of Distributed Lucene are explained in the following sections:
Distributed Lucene Workflow
The diagram below shows how the Distributed Lucene model works.
The client application may want to index and analyze (analyzed by supported Lucene analyzers) documents or query existing indexed documents by using the Lucene API. These operations with the API act like Remote Procedure Calls (RPCs) and are directly forwarded to the NCache cluster. The cluster determines the nature of these calls and forwards the calls to the Distributed Lucene modules attached to each server node. These modules execute these calls, and depending on the nature of the calls, either of the following actions takes place:
Query Document Call: In case it is a query call, the Distributed Lucene modules return results to the client side, where all of these results are merged and processed.
Index Document Call: In case it is a call to index a document, the Distributed Lucene Modules persist that document on a disk drive.
Data Distribution
A distribution map is generated against a cache cluster. For Distributed Lucene, this map is generated:
- On cache creation.
- On the addition or removal of a server node.
Warning
In case the distribution map doesn't exist, the cache fails to start, and exceptions are logged in the event viewer and service logs.
This map contains information regarding the buckets against the cache nodes. The total number of buckets for a Distributed Lucene cache is 100. These 100 buckets are distributed in the cluster using a specific strategy.
Having 100 buckets means that an index is split into 100 sub-indexes across the cache cluster. Based on the number of buckets assigned to each node, all of the respective index files are also moved to that particular node. Hence, whenever state transfer causes the movement of buckets across the cache cluster, the corresponding index files are also moved as a part of the process.
A server node can contain multiple indexes, and each index within that server node contains buckets that are assigned to it according to the distribution strategy of the cluster. The documents against the indexes are equally distributed via these buckets. These indexes are also persisted on your physical drive.
NCache has runtime or dynamic distribution. Node start or stop will trigger a change in the distribution map of the cache cluster and trigger state transfer. However, in NCache Distributed Lucene, the indexes are persisted on your physical drive to avoid unnecessary distribution of indexes on node start or stop. Meaning, that in NCache Distributed Lucene, only the addition or removal of a node from the cache cluster will change the distribution map and trigger state transfer for the running server nodes. For the stopped server nodes, state transfer will take place once they come back online.
The following are some important points to consider for NCache Distributed Lucene:
- Distribution Map of Cache: The distribution map of a new cache is only generated when the cache creation is successful. If distribution map generation fails for some reason, the cache creation rolls back (reverts).
- State Transfer: Any change in the configuration of the cache cluster membership will cause changes in the distribution map because this will trigger state transfer in the cache cluster.
- Node Addition to an Existing Cluster: In this case, all server nodes (even the one that is being added) should be physically available and their service should be running. When a node is added to the Distributed Lucene cache cluster, the existing distribution map is fetched from the existing server nodes. A new distribution map is then generated and shared with all the server nodes. A commit call is sent to persist this distribution map.
- Node Removal from an Existing Cluster: In this case, similar to node addition to an existing cluster, all server nodes (except for the one that is being removed) should be physically available, and their service should be running. Otherwise, this operation will fail.
- Node Shutdown During State Transfer: If a server node shuts down during state transfer, the state transfer for that node is halted and resumed from the same bucket (where node shuts down) when the node starts again.
Important
In the case of Developer Installation, stores can configure Distributed Lucene.
Initialize Distributed Lucene
Before you start to use Distributed Lucene to index your documents and search them afterward, you need to initialize it first. Once you have initialized it, you need to provide the cache name and index name.
Index Data
The indexing process in Distributed Lucene is the same as Lucene itself. In the case of Distributed Lucene, NCache maintains a key-value store for the distribution of documents, and an autogenerated key is added to each document. The document is indexed on the node against that specific key.
Index Searching
Once Distributed Lucene has been initialized and your documents have been indexed, you can perform text-based searches on these documents.
Facets
A category is an essential aspect of an indexed document that is used to classify it. For example, while searching for clothes in an e-commerce store, the categories of the clothes can be price, material, brand, etc.
In faceted search, in addition to the normal search results, you also get facet results, which consist of subcategories for certain categories. Continuing the example above, the subcategories for material facets can be cotton, wool, leather, etc.
Faceted search makes it very easy for you to search for specific documents that you require. NCache now supports Facets with Distributed Lucene, which will aid you in desired documents efficiently and effectively. The working of facets in Lucene and Distributed Lucene is mostly the same but with a lot of performance enhancements that are highlighted in Distributed Lucene Facets.
Geo-Spatial API
Data that contains coordinate values of longitude and latitude is referred to as Geo-Spatial Data. This data is useful if you want to search data based on its location. For example, you want to index a document that contains the information of a restaurant. This document contains various fields, and one of those fields contains the longitude and latitude values. To search this document in the future based on its location (let's say the restaurant nearest to you), you'll also want to index the longitude and latitude fields.
Lucene has a very powerful Geo-Spatial Data indexing and searching feature, and now NCache also allows you to index and then search documents with respect to their location with the Lucene API. The following link will help you understand how to index data and search it by referring to Distributed Lucene Geo-Spatial API section.
Distributed Lucene Behavior in a Partial Cluster
A cache cluster is declared partial when one or more nodes inside it become unavailable. Hence, the connectivity is limited, making the cache cluster a partial cluster. The behavior of Distributed Lucene in a partial cluster is explained below:
Read Operations on a Partial Cluster
NCache Distributed Lucene allows you to retrieve (read) data from a partial cluster by setting the value of the AllowPartialResults
property to TRUE
on the IndexReader
class instance. By default, the value of the AllowPartialResults
flag is set to FALSE
.
These read operations will return partial or incomplete data. However, in the case of Partition-Replica topology, the cache cluster will tolerate a single node failure through its replica, and you will be able to retrieve complete data. But, in case of multiple node failures, you will only be able to retrieve partial data.
Warning
If you try to read data from a partial cluster and you have set the value of the AllowPartialResults
flag to FALSE
, an exception will be thrown.
Write Operations on a Partial Cluster
Distributed Lucene doesn't allow you to perform write operations such as Add, Update, and Delete on a partial cluster. If you try to perform these operations on a partial cluster, an exception will be thrown.
Not Supported Lucene API
Given below is a list of Lucene APIs not supported in Distributed Lucene.
DirectoryReader
public static DirectoryReader Open(IndexCommit commit)
public static DirectoryReader Open(IndexCommit commit, int termInfosIndexDivisor)
public static DirectoryReader OpenIfChanged(DirectoryReader oldReader)
public static DirectoryReader OpenIfChanged(DirectoryReader oldReader, IndexCommit commit)
public static DirectoryReader OpenIfChanged(DirectoryReader oldReader, IndexWriter writer, bool applyAllDeletes)
IndexSearcher
public IndexSearcher(IndexReaderContext context, TaskScheduler executor)
public Document Document(int docID, ISet<string> fieldsToLoad)
public virtual Weight CreateNormalizedWeight(Query query)
public virtual TopDocs SearchAfter(ScoreDoc after, Query query, int n)
public virtual TopDocs SearchAfter(ScoreDoc after, Query query, Filter filter, int n)
public virtual TopDocs SearchAfter(ScoreDoc after, Query query, Filter filter, int n, Sort sort)
public virtual TopDocs SearchAfter(ScoreDoc after, Query query, int n, Sort sort)
public virtual TopDocs SearchAfter(ScoreDoc after, Query query, Filter filter, int n, Sort sort, bool doDocScores, bool doMaxScore)
IndexReader
public static DirectoryReader Open(Directory directory)
public static DirectoryReader Open(Directory directory, int termInfosIndexDivisor)
public static DirectoryReader Open(IndexWriter writer, bool applyAllDeletes)
public static DirectoryReader Open(IndexCommit commit)
public static DirectoryReader Open(IndexCommit commit, int termInfosIndexDivisor)
public IList<AtomicReaderContext> Leaves
NCacheDirectory
public override string[] ListAll()
public override long FileLength(string name)
public override void DeleteFile(string name)
public override string GetLockID()
public override IndexInput OpenInput(string name, IOContext context)
CompositeReader
public override sealed IndexReaderContext Context
Additional Resources
NCache provides a sample application for Distributed Lucene on GitHub.
See Also
Lucene Components and Overview
Configure Lucene Query Indexes
SQL Search in Cache
Search Cache with LINQ