Class Analyzer

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

In order to define what analysis is done, subclasses must define their TokenStreamComponents in CreateComponents(String, TextReader). The components are then reused in each call to GetTokenStream(String, TextReader).

Simple example:

Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => 
{
    Tokenizer source = new FooTokenizer(reader);
    TokenStream filter = new FooFilter(source);
    filter = new BarFilter(filter);
    return new TokenStreamComponents(source, filter);
});

For more examples, see the Lucene.Net.Analysis namespace documentation.

For some concrete implementations bundled with Lucene, look in the analysis modules:

Common: Analyzers for indexing content in different languages and domains.
ICU: Exposes functionality from ICU to Apache Lucene.
Kuromoji: Morphological analyzer for Japanese text.
Morfologik: Dictionary-driven lemmatization for the Polish language.
Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
Stempel: Algorithmic Stemmer for the Polish Language.
UIMA: Analysis integration with Apache UIMA.

Inheritance

System.Object

Analyzer

ICUCollationKeyAnalyzer

Assembly: DistributedLucene.Net.dll

Syntax

public abstract class Analyzer : IDisposable

Constructors

Name	Description
Analyzer()	Create a new Analyzer, reusing the same set of components per-thread across calls to GetTokenStream(String, TextReader).
Analyzer(ReuseStrategy)	Expert: create a new Analyzer with a custom ReuseStrategy. NOTE: if you just want to reuse on a per-field basis, its easier to use a subclass of AnalyzerWrapper such as `Lucene.Net.Analysis.Common.Miscellaneous.PerFieldAnalyzerWrapper` instead.

Fields

Name	Description
GLOBAL_REUSE_STRATEGY	A predefined ReuseStrategy that reuses the same components for every field.
PER_FIELD_REUSE_STRATEGY	A predefined ReuseStrategy that reuses components per-field by maintaining a Map of TokenStreamComponents per field name.

Properties

Name	Description
Strategy	Returns the used ReuseStrategy.

Methods

Name	Description
CreateComponents(String, TextReader)	Creates a new TokenStreamComponents instance for this analyzer.
Dispose()	Frees persistent resources used by this Analyzer
GetObjectData(SerializationInfo, StreamingContext)
GetOffsetGap(String)	Just like GetPositionIncrementGap(String), except for Token offsets instead. By default this returns 1. this method is only called if the field produced at least one token for indexing.
GetPositionIncrementGap(String)	Invoked before indexing a IIndexableField instance if terms have already been added to that field. This allows custom analyzers to place an automatic position increment gap between IIndexableField instances using the same field name. The default value position increment gap is 0. With a 0 position increment gap and the typical default token position increment of 1, all terms in a field, including across IIndexableField instances, are in successive positions, allowing exact PhraseQuery matches, for instance, across IIndexableField instance boundaries.
GetTokenStream(String, String)	Returns a TokenStream suitable for `fieldName`, tokenizing the contents of `text`. This method uses CreateComponents(String, TextReader) to obtain an instance of TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through SetReader(TextReader). NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Lucene.Net.Analysis namespace documentation for some examples demonstrating this.
GetTokenStream(String, TextReader)	Returns a TokenStream suitable for `fieldName`, tokenizing the contents of `text`. This method uses CreateComponents(String, TextReader) to obtain an instance of TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through SetReader(TextReader). NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Lucene.Net.Analysis namespace documentation for some examples demonstrating this.
InitReader(String, TextReader)	Override this if you want to add a CharFilter chain. The default implementation returns `reader` unchanged.
NewAnonymous(Func<String, TextReader, TokenStreamComponents>)	Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader) method through the `createComponents` parameter. Simple example: `var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => { Tokenizer source = new FooTokenizer(reader); TokenStream filter = new FooFilter(source); filter = new BarFilter(filter); return new TokenStreamComponents(source, filter); });` LUCENENET specific
NewAnonymous(Func<String, TextReader, TokenStreamComponents>, Func<String, TextReader, TextReader>)	Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader) method through the `createComponents` parameter and the body of the InitReader(String, TextReader) method through the `initReader` parameter. Simple example: `var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => { Tokenizer source = new FooTokenizer(reader); TokenStream filter = new FooFilter(source); filter = new BarFilter(filter); return new TokenStreamComponents(source, filter); }, initReader: (fieldName, reader) => { return new HTMLStripCharFilter(reader); });` LUCENENET specific
NewAnonymous(Func<String, TextReader, TokenStreamComponents>, Func<String, TextReader, TextReader>, ReuseStrategy)	Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader) method through the `createComponents` parameter, the body of the InitReader(String, TextReader) method through the `initReader` parameter, and allows the use of a ReuseStrategy. Simple example: `var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => { Tokenizer source = new FooTokenizer(reader); TokenStream filter = new FooFilter(source); filter = new BarFilter(filter); return new TokenStreamComponents(source, filter); }, initReader: (fieldName, reader) => { return new HTMLStripCharFilter(reader); }, reuseStrategy);` LUCENENET specific
NewAnonymous(Func<String, TextReader, TokenStreamComponents>, ReuseStrategy)	Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader) method through the `createComponents` parameter and allows the use of a ReuseStrategy. Simple example: `var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => { Tokenizer source = new FooTokenizer(reader); TokenStream filter = new FooFilter(source); filter = new BarFilter(filter); return new TokenStreamComponents(source, filter); }, reuseStrategy);` LUCENENET specific

Extension Methods

Number.IsNumber(Object)

SystemTypesHelpers.toString(Object)

SystemTypesHelpers.equals(Object, Object)