Namespace Lucene.Net.Analysis

Classes

Analyzer

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

In order to define what analysis is done, subclasses must define their TokenStreamComponents in CreateComponents(String, TextReader). The components are then reused in each call to GetTokenStream(String, TextReader).

Simple example:

Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => 
{
    Tokenizer source = new FooTokenizer(reader);
    TokenStream filter = new FooFilter(source);
    filter = new BarFilter(filter);
    return new TokenStreamComponents(source, filter);
});

For more examples, see the Lucene.Net.Analysis namespace documentation.

For some concrete implementations bundled with Lucene, look in the analysis modules:

Common: Analyzers for indexing content in different languages and domains.
ICU: Exposes functionality from ICU to Apache Lucene.
Kuromoji: Morphological analyzer for Japanese text.
Morfologik: Dictionary-driven lemmatization for the Polish language.
Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
Stempel: Algorithmic Stemmer for the Polish Language.
UIMA: Analysis integration with Apache UIMA.

Analyzer.GlobalReuseStrategy

Implementation of ReuseStrategy that reuses the same components for every field.

Analyzer.PerFieldReuseStrategy

Implementation of ReuseStrategy that reuses components per-field by maintaining a Map of TokenStreamComponents per field name.

AnalyzerWrapper

Extension to Analyzer suitable for Analyzers which wrap other Analyzers.

GetWrappedAnalyzer(String) allows the Analyzer to wrap multiple Analyzers which are selected on a per field basis.

WrapComponents(String, TokenStreamComponents) allows the TokenStreamComponents of the wrapped Analyzer to then be wrapped (such as adding a new TokenFilter to form new TokenStreamComponents).

BaseTokenStreamTestCase.CheckClearAttributesAttribute

Attribute that records if it was cleared or not. this is used for testing that ClearAttributes() was called correctly.

CachingTokenFilter

This class can be used if the token attributes of a TokenStream are intended to be consumed more than once. It caches all token attribute states locally in a List.

CachingTokenFilter implements the optional method Reset(), which repositions the stream to the first Token.

CannedBinaryTokenStream

TokenStream from a canned list of binary (BytesRef-based) tokens.

CannedBinaryTokenStream.BinaryTermAttribute

Implementation for CannedBinaryTokenStream.IBinaryTermAttribute.

CannedBinaryTokenStream.BinaryToken

Represents a binary token.

CannedTokenStream

TokenStream from a canned list of Tokens.

CharFilter

Subclasses of CharFilter can be chained to filter a They can be used as with additional offset correction. Tokenizers will automatically use CorrectOffset(Int32) if a CharFilter subclass is used.

This class is abstract: at a minimum you must implement , transforming the input in some way from m_input, and Correct(Int32) to adjust the offsets to match the originals.

You can optionally provide more efficient implementations of additional methods like , but this is not required.

For examples and integration with Analyzer, see the Lucene.Net.Analysis namespace documentation.

CollationTestBase

base test class for testing Unicode collation.

An abstract TokenFilter to make it easier to build graph token filters requiring some lookahead. this class handles the details of buffering up tokens, recording them by position, restoring them, providing access to them, etc.

LookaheadTokenFilter.Position

Holds all state for a single position; subclass this to record other state at each position.

LookaheadTokenFilter<T>

MockAnalyzer

MockBytesAnalyzer

Analyzer for testing that encodes terms as UTF-16 bytes.

MockBytesAttributeFactory

Attribute factory that implements CharTermAttribute with MockUTF16TermAttributeImpl

MockCharFilter

the purpose of this charfilter is to send offsets out of bounds if the analyzer doesn't use correctOffset or does incorrect offset math.

MockFixedLengthPayloadFilter

TokenFilter that adds random fixed-length payloads.

MockGraphTokenFilter

Randomly inserts overlapped (posInc=0) tokens with posLength sometimes > 1. The chain must have an OffsetAttribute.

MockHoleInjectingTokenFilter

Randomly injects holes (similar to what a stopfilter would do)

MockPayloadAnalyzer

Wraps a whitespace tokenizer with a filter that sets the first token, and odd tokens to posinc=1, and all others to 0, encoding the position as pos: XXX in the payload.

MockRandomLookaheadTokenFilter

Uses LookaheadTokenFilter to randomly peek at future tokens.

MockReaderWrapper

Wraps a Reader, and can throw random or fixed exceptions, and spoon feed read chars.

MockTokenFilter

MockTokenizer

MockUTF16TermAttributeImpl

Extension of that encodes the term text as UTF-16 bytes instead of as UTF-8 bytes.

MockVariableLengthPayloadFilter

TokenFilter that adds random variable-length payloads.

NumericTokenStream

Expert: this class provides a TokenStream for indexing numeric values that can be used by NumericRangeQuery or NumericRangeFilter.

Note that for simple usage, Int32Field, Int64Field, SingleField or DoubleField is recommended. These fields disable norms and term freqs, as they are not usually needed during searching. If you need to change these settings, you should use this class.

Here's an example usage, for an field:

    IndexableFieldType fieldType = new IndexableFieldType(TextField.TYPE_NOT_STORED)
    {
        OmitNorms = true,
        IndexOptions = IndexOptions.DOCS_ONLY
    };
    Field field = new Field(name, new NumericTokenStream(precisionStep).SetInt32Value(value), fieldType);
    document.Add(field);

For optimal performance, re-use the TokenStream and Field instance for more than one document:

    NumericTokenStream stream = new NumericTokenStream(precisionStep);
    IndexableFieldType fieldType = new IndexableFieldType(TextField.TYPE_NOT_STORED)
    {
        OmitNorms = true,
        IndexOptions = IndexOptions.DOCS_ONLY
    };
    Field field = new Field(name, stream, fieldType);
    Document document = new Document();
    document.Add(field);

    for(all documents) 
    {
        stream.SetInt32Value(value)
        writer.AddDocument(document);
    }

this stream is not intended to be used in analyzers; it's more for iterating the different precisions during indexing a specific numeric value.

NOTE: as token streams are only consumed once the document is added to the index, if you index more than one numeric field, use a separate NumericTokenStream instance for each.

See NumericRangeQuery for more details on the precisionStep parameter as well as how numeric fields work under the hood.

@since 2.9

NumericTokenStream.NumericTermAttribute

Implementation of NumericTokenStream.INumericTermAttribute. @lucene.internal @since 4.0

ReusableStringReader

Internal class to enable reuse of the string reader by GetTokenStream(String, String)

ReuseStrategy

Strategy defining how TokenStreamComponents are reused per call to GetTokenStream(String, TextReader).

Token

A Token is an occurrence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string.

The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.

The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".

A Token can optionally have metadata (a.k.a. payload) in the form of a variable length byte array. Use GetPayload() to retrieve the payloads from the index.

NOTE: As of 2.9, Token implements all IAttribute interfaces that are part of core Lucene and can be found in the Lucene.Net.Analysis.TokenAttributes namespace. Even though it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all IAttributes, which is especially useful to easily switch from the old to the new TokenStream API.

Tokenizers and TokenFilters should try to re-use a Token instance when possible for best performance, by implementing the IncrementToken() API. Failing that, to create a new Token you should first use one of the constructors that starts with null text. To load the token from a char[] use CopyBuffer(Char[], Int32, Int32). To load from a use SetEmpty() followed by Append(String) or Append(String, Int32, Int32). Alternatively you can get the Token's termBuffer by calling either Buffer, if you know that your text is shorter than the capacity of the termBuffer or ResizeBuffer(Int32), if there is any possibility that you may need to grow the buffer. Fill in the characters of your term into this buffer, with if loading from a string, or with , and finally call SetLength(Int32) to set the length of the term text. See LUCENE-969 for details.

Typical Token reuse patterns:

Copying text from a string (type is reset to DEFAULT_TYPE if not specified):

    return reusableToken.Reinit(string, startOffset, endOffset[, type]);

Copying some text from a string (type is reset to DEFAULT_TYPE if not specified):

    return reusableToken.Reinit(string, 0, string.Length, startOffset, endOffset[, type]);

Copying text from char[] buffer (type is reset to DEFAULT_TYPE if not specified):

    return reusableToken.Reinit(buffer, 0, buffer.Length, startOffset, endOffset[, type]);

Copying some text from a char[] buffer (type is reset to DEFAULT_TYPE if not specified):

    return reusableToken.Reinit(buffer, start, end - start, startOffset, endOffset[, type]);

Copying from one one Token to another (type is reset to DEFAULT_TYPE if not specified):

    return reusableToken.Reinit(source.Buffer, 0, source.Length, source.StartOffset, source.EndOffset[, source.Type]);

A few things to note:

Clear() initializes all of the fields to default values. this was changed in contrast to Lucene 2.4, but should affect no one.
Because TokenStreams can be chained, one cannot assume that the Token's current type is correct.
The startOffset and endOffset represent the start and offset in the source text, so be careful in adjusting them.
When caching a reusable token, clone it. When injecting a cached token into a stream that can be reset, clone it again.

Please note: With Lucene 3.1, the ToString() method had to be changed to match the ICharSequence interface introduced by the interface ICharTermAttribute. this method now only prints the term text, no additional information anymore.

Token.TokenAttributeFactory

Expert: Creates a Token.TokenAttributeFactory returning Token as instance for the basic attributes and for all other attributes calls the given delegate factory. @since 3.0

TokenFilter

A TokenFilter is a TokenStream whose input is another TokenStream.

This is an abstract class; subclasses must override IncrementToken().

Tokenizer

A Tokenizer is a TokenStream whose input is a .

This is an abstract class; subclasses must override IncrementToken()

NOTE: Subclasses overriding IncrementToken() must call ClearAttributes() before setting attributes.

TokenStream

A TokenStream enumerates the sequence of tokens, either from Fields of a Document or from query text.

this is an abstract class; concrete subclasses are:

Tokenizer, a TokenStream whose input is a ; and
TokenFilter, a TokenStream whose input is another TokenStream.

A new TokenStream API has been introduced with Lucene 2.9. this API has moved from being Token-based to IAttribute-based. While Token still exists in 2.9 as a convenience class, the preferred way to store the information of a Token is to use s.

TokenStream now extends AttributeSource, which provides access to all of the token IAttributes for the TokenStream. Note that only one instance per is created and reused for every token. This approach reduces object creation and allows local caching of references to the s. See IncrementToken() for further details.

The workflow of the new TokenStream API is as follows:

Instantiation of TokenStream/TokenFilters which add/get attributes to/from the AttributeSource.
The consumer calls Reset().
The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.
The consumer calls IncrementToken() until it returns false consuming the attributes after each call.
The consumer calls End() so that any end-of-stream operations can be performed.
The consumer calls Dispose() to release any resource when finished using the TokenStream.

To make sure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in IncrementToken().

You can find some example code for the new API in the analysis documentation.

Sometimes it is desirable to capture a current state of a TokenStream, e.g., for buffering purposes (see CachingTokenFilter, TeeSinkTokenFilter). For this usecase CaptureState() and RestoreState(AttributeSource.State) can be used.

The TokenStream-API in Lucene is based on the decorator pattern. Therefore all non-abstract subclasses must be sealed or have at least a sealed implementation of IncrementToken()! This is checked when assertions are enabled.

TokenStreamComponents

This class encapsulates the outer components of a token stream. It provides access to the source (Tokenizer) and the outer end (sink), an instance of TokenFilter which also serves as the TokenStream returned by GetTokenStream(String, TextReader).

TokenStreamToAutomaton

Consumes a TokenStream and creates an Automaton where the transition labels are UTF8 bytes (or Unicode code points if unicodeArcs is true) from the ITermToBytesRefAttribute. Between tokens we insert POS_SEP and for holes we insert HOLE.

@lucene.experimental

TokenStreamToDot

Consumes a TokenStream and outputs the dot (graphviz) string (graph).

ValidatingTokenFilter

A TokenFilter that checks consistency of the tokens (eg offsets are consistent with one another).

VocabularyAssert

Utility class for doing vocabulary-based stemming tests

Namespace Lucene.Net.Analysis

Classes

Interfaces

Contact Us