Lucene is an efficient and powerful search engine that supports full-text searching in .NET. It makes text searching easy and provides you with a rich set of APIs, for fast and user-friendly text searching techniques.
Additionally, there is one thing that Lucene does not offer, and that is scalability. Lucene applications usually write data on a file and store it on the disk, resulting in a significant memory allocation. NCache provides a Distributed Lucene feature that makes your Lucene applications scalable and distributed. It also provides a linearly scalable and distributed alternative to the issue of a single point of failure.
NCache Details Working of Distributed Lucene Distributed Lucene Docs
For more information on Distributed Lucene, it’s working, and why it should be your go-to option have a look at this Distributed Lucene: Full-Text Searching in .NET for Scalability. Before getting into the details of Distributed Lucene, let us skim through a Distributed Lucene solution demonstrating the basic workflow.
The code sample below shows a Distributed Lucene which is primarily performing these three steps:
- Initialize the NCache Directory.
- Index the documents for performing a search on them.
- Perform a search on the indexed documents.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
{ // Specify the cache name that is used for Lucene string cache = "LuceneCache"; // Specify the index name to create the indexes string indexName = "ProductIndex"; // Create a directory and open it on the cache and the index path Directory directory = NCacheDirectory.Open(cache, indexName); // Specify the analyzer used to analyze data Analyzer analyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48); // Create an indexWriterConfig which holds all the configurations to create an instance of the writer IndexWriterConfig config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer); // Create the indexWriter with the analyzer and the configuration IndexWriter indexWriter = new IndexWriter(directory, config); // Add the products information that is to be indexed Product[] products = FetchProductsFromDB(); foreach (var prod in products) { // Create a document and add fields to it Document doc = new Document(); doc.Add(new TextField("id", prod.ProductID(), Field.Store.YES)); doc.Add(new TextField("name", prod.ProductName, Field.Store.NO)); doc.Add(new TextField("category", prod.Category, Field.Store.YES)); doc.Add(new TextField("description", prod.Description, Field.Store.YES)); // Writer is created previously indexWriter.AddDocument(doc); // Call ‘Commit’ to save the changes indexWriter.Commit(); } // Open a new reader instance IndexReader reader = indexWriter.GetReader(true); // A searcher is opened to perform searching IndexSearcher indexSearcher = new IndexSearcher(reader); // Specify the searchTerm and the fieldName string searchTerm = "Beverages"; string fieldName = "Category"; LuceneVersion version = LuceneVersion.LUCENE_48; // Create a query parser and parse the query with the parser // Analyzer is whitespace analyzer as specified beforehand QueryParser parser = new QueryParser(version, fieldName, analyzer); Query query = parser.Parse(searchTerm); // Returns the top 10 hits from the result set ScoreDoc[] docsFound = indexSearcher.Search(query, 10).ScoreDocs; // Closes all the files associated with this index reader.Dispose(); } catch(Exception ex) { // Handle Lucene exceptions } |
In this blog, we are focusing on the detailed steps and working of distributed Lucene for full text searching.
NCache Details Working of Distributed Lucene Lucene Components and Overview
How to Migrate to Distributed Lucene?
Distributed Lucene works just as Lucene does. One major comfort of using distributed Lucene is that it gives you the same API as Lucene. As a Lucene user, you get the scalability you wish for with an add-on of a single liner code change. You just have to use NCache Directory and your application is good to go. There are very few behavioral and API changes in distributed Lucene which are listed in the documentation here.
Let us take a closer look at these steps from a technical aspect.
Step 1: Connecting to NCache Directory
Let us start with introducing NCache in your .NET application using Lucene. The primary step is to replace the Lucene.NET Nuget package from your library with NCache’s Nuget package. NCache Directory, as the name suggests, is a base class for storing the indexes for making the indexes scalable. Lucene.NET has multiple directory implementations for this purpose but the most commonly used is FSDirectory. So, the first step is to establish a connection with the NCache Directory.
Below is the code which connects you to a cache called “luceneCache” and opens the provided directory at all the existing servers provided that this directory already exists. Otherwise, it just creates a new directory with the provided name.
1 |
var indexDirectory = NCacheDirectory.Open("luceneCache", new DirectoryInfo("lucene-index-path")); |
With this, NCache requires another small step to use Distributed Lucene. Using any management tool, enable the Lucene index for the cache using Distributed Lucene. Please refer to this chapter in NCache Documentation for guidelines on doing it.
Setup Lucene Distributed Lucene for Enterprise Search Distributed Lucene Docs
Step 2: Creating Distributed Lucene Indexes with NCache
Now that NCache is incorporated into your Lucene application, comes the data writing phase. As discussed earlier, Lucene indexes your records in the form of indexes. These indexes are maintained on the nodes separately providing linear scalability. The distribution of documents among the cache nodes is automatically handled by NCache.
Documents are formed from fields which are key-value pairs. Each field contains the text which is to be made searchable by you. The other parts of the field’s constructor contain instructions for handling an individual field.
These documents are stored in the NCache Directory in the form of Lucene inverted indexes. Lucene breaks down the text in the form of tokens. This process of breaking down texts for indexing is done by using analyzers. There are multiple types of analyzers such as whitespace analyzers or standard analyzers. This tokenization of data helps in faster data search as the main purpose of analyzers is to remove the noise words and index the rest of the data on which searching is performed. For a complete knowledge of analyzers please refer to this page.
Documents are indexed in the directory using IndexWriter. IndexWriter is responsible for performing all the write operations on the cache. More precisely, you can add other field properties to make the search efficient such as FieldStore. The code below thoroughly explains writing documents (as bulk) in the cache and as a result, creating indexes at all cache servers. Please make sure to call Commit after every write operation otherwise the operation is not recorded for performing search.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
public int AddDocuments () { var analyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48); var writerConfig = new IndexWriterConfig (LuceneVersion.LUCENE_48, analyzer); var indexWriter = new IndexWriter (indexDirectory, writerConfig); var docCount = 0; // Fetching repository info from data source using (var enumerator = dataProvider.GetProductFTSEnumerator ()) { while (enumerator.MoveNext ()) { docs.Add (enumerator.Current.GetLuceneDocument ()); docCount++; // Perform commit after the chunk of code if (docs.Count == BULK_SIZE) { indexWriter.AddDocuments (docs); //Flush and make it ready for search indexWriter.Commit (); docs.Clear (); //Remove the added documents } } } // Return the count of added documents return docCount; } |
NCache Details Working of Distributed Lucene GeoSpatial Indexes for Distributed Lucene
Step 3: Full Text Searching the Data with Scalability
The actual task of searching data in the cache using distributed Lucene comes at this step. Similar to data indexing, the reads are also performed on all the nodes of the cache and search results are merged eventually. This increases the read scalability too and makes the search faster.
Similar to IndexWriter in Lucene, there is an IndexSearcher that does all the heavy lifting for performing the data reading operations based on the search. IndexSearcher passes an analyzer, and it is highly recommended to use the same analyzer for searching, which was previously used for writing the data otherwise, the results are inconsistent.
Next thing Lucene uses for performing searches is the queries. There is a wide range of built-in query classes in Lucene as per your needs along with a Lucene-specific syntax. These queries make your search quite efficient, for example, there is a wildcard query that performs a search using wildcards and renders the search results accordingly. Some other query types are TermQuery, BooleanQuery, and SpanQuery.
The search results are returned in the form of hits, which is a list of documents returned as a result of the search performed. These Hits can then be iterated to get the actual document matching the query.
The code provided below is going to help you witness the searching performed on the Lucene indexes. IndexSearcher uses the IndexReader for fetching the results. A query is used to perform the actual search on the indexed data. A QueryParser parses the user provided textual query according to the analyzer and the search terms and the results are returned in the form of TopDocs.
In the example given below, a QueryWrapperFilter is applied on the Products’ category and the search results are sorted by relevance. The query applies fuzziness to all the terms to provide optimal results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
public Tuple<long, List<RepositoryInfo>> Search (string searchTerm, int top = 100, string category = null) { long totalHints = 0; var repoList = new List<ProductFTS> (); // Now create the IndexSearcher instance to search the data // directoryReader is the IndexReader instance which uses the IndexWriter var searcher = new IndexSearcher (directoryReader); try { TopDocs topDocs = null; var queryParser = new MultiFieldQueryParser (Version, Fields, analyzer); var query = new BooleanQuery (); // Split the search term into multiple search words to make fuzzy query string[] terms = searchTerm.Split (new [] { " " }, StringSplitOptions.RemoveEmptyEntries); foreach (string term in terms) query.Add (queryParser.Parse (term.Replace ("~", "") + "~"), Occur.MUST); // remove the duplicate ~, if already exits if (!string.IsNullOrEmpty (category)) { var filter = new QueryWrapperFilter (new TermQuery (new Term (CATEGORY_FIELD, category))); // Performs the search based on the filter applied topDocs = searcher.Search (query, filter, top, Sort.RELEVANCE); } else { topDocs = searcher.Search (query, top, sort : Sort.RELEVANCE); } totalHits = topDocs.TotalHits; repoList = GetSearchedDocs (searcher, topDocs); } return new Tuple<long, List<ProductFTS>> (totalHits, repoList); } |
You can then implement your business logic to iterate the search results from the hits.
Conclusion
Lucene is a highly efficient search engine for performing full text searching on your data, but it lacks scalability. NCache can be used with Lucene to make it scalable with a very little effort. Scalable distributed Lucene makes your application not just faster but also help you deal with the major setback of single point of failure. NCache can be easily plugged into your .NET Application with a single-line code change so consider it the best possible option for your scalable Lucene application.