Class Analyzer
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.
In order to define what analysis is done, subclasses must define their TokenStreamComponents in CreateComponents(String, TextReader). The components are then reused in each call to GetTokenStream(String, TextReader).
Simple example:
Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
{
Tokenizer source = new FooTokenizer(reader);
TokenStream filter = new FooFilter(source);
filter = new BarFilter(filter);
return new TokenStreamComponents(source, filter);
});
For more examples, see the Lucene.Net.Analysis namespace documentation.
For some concrete implementations bundled with Lucene, look in the analysis modules:
- Common: Analyzers for indexing content in different languages and domains.
- ICU: Exposes functionality from ICU to Apache Lucene.
- Kuromoji: Morphological analyzer for Japanese text.
- Morfologik: Dictionary-driven lemmatization for the Polish language.
- Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
- Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
- Stempel: Algorithmic Stemmer for the Polish Language.
- UIMA: Analysis integration with Apache UIMA.
Inheritance
Assembly: DistributedLucene.Net.dll
Syntax
public abstract class Analyzer : IDisposable
Constructors
Name | Description |
---|---|
Analyzer() | Create a new Analyzer, reusing the same set of components per-thread across calls to GetTokenStream(String, TextReader). |
Analyzer(ReuseStrategy) | Expert: create a new Analyzer with a custom ReuseStrategy.
NOTE: if you just want to reuse on a per-field basis, its easier to
use a subclass of AnalyzerWrapper such as
|
Fields
Name | Description |
---|---|
GLOBAL_REUSE_STRATEGY | A predefined ReuseStrategy that reuses the same components for every field. |
PER_FIELD_REUSE_STRATEGY | A predefined ReuseStrategy that reuses components per-field by maintaining a Map of TokenStreamComponents per field name. |
Properties
Name | Description |
---|---|
Strategy | Returns the used ReuseStrategy. |
Methods
Name | Description |
---|---|
CreateComponents(String, TextReader) | Creates a new TokenStreamComponents instance for this analyzer. |
Dispose() | Frees persistent resources used by this Analyzer |
GetObjectData(SerializationInfo, StreamingContext) | |
GetOffsetGap(String) | Just like GetPositionIncrementGap(String), except for Token offsets instead. By default this returns 1. this method is only called if the field produced at least one token for indexing. |
GetPositionIncrementGap(String) | Invoked before indexing a IIndexableField instance if terms have already been added to that field. This allows custom analyzers to place an automatic position increment gap between IIndexableField instances using the same field name. The default value position increment gap is 0. With a 0 position increment gap and the typical default token position increment of 1, all terms in a field, including across IIndexableField instances, are in successive positions, allowing exact PhraseQuery matches, for instance, across IIndexableField instance boundaries. |
GetTokenStream(String, String) | Returns a TokenStream suitable for This method uses CreateComponents(String, TextReader) to obtain an instance of TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through SetReader(TextReader). NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Lucene.Net.Analysis namespace documentation for some examples demonstrating this. |
GetTokenStream(String, TextReader) | Returns a TokenStream suitable for This method uses CreateComponents(String, TextReader) to obtain an instance of TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through SetReader(TextReader). NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Lucene.Net.Analysis namespace documentation for some examples demonstrating this. |
InitReader(String, TextReader) | Override this if you want to add a CharFilter chain.
The default implementation returns |
NewAnonymous(Func<String, TextReader, TokenStreamComponents>) | Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader)
method through the
LUCENENET specific |
NewAnonymous(Func<String, TextReader, TokenStreamComponents>, Func<String, TextReader, TextReader>) | Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader)
method through the
LUCENENET specific |
NewAnonymous(Func<String, TextReader, TokenStreamComponents>, Func<String, TextReader, TextReader>, ReuseStrategy) | Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader)
method through the
LUCENENET specific |
NewAnonymous(Func<String, TextReader, TokenStreamComponents>, ReuseStrategy) | Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader)
method through the
LUCENENET specific |