Namespace Lucene.Net.Analysis
Classes
Analyzer
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.
In order to define what analysis is done, subclasses must define their TokenStreamComponents in CreateComponents(String, TextReader). The components are then reused in each call to GetTokenStream(String, TextReader).
Simple example:
Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
{
Tokenizer source = new FooTokenizer(reader);
TokenStream filter = new FooFilter(source);
filter = new BarFilter(filter);
return new TokenStreamComponents(source, filter);
});
For more examples, see the Lucene.Net.Analysis namespace documentation.
For some concrete implementations bundled with Lucene, look in the analysis modules:
- Common: Analyzers for indexing content in different languages and domains.
- ICU: Exposes functionality from ICU to Apache Lucene.
- Kuromoji: Morphological analyzer for Japanese text.
- Morfologik: Dictionary-driven lemmatization for the Polish language.
- Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
- Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
- Stempel: Algorithmic Stemmer for the Polish Language.
- UIMA: Analysis integration with Apache UIMA.
Analyzer.GlobalReuseStrategy
Implementation of ReuseStrategy that reuses the same components for every field.
Analyzer.PerFieldReuseStrategy
Implementation of ReuseStrategy that reuses components per-field by maintaining a Map of TokenStreamComponents per field name.
AnalyzerWrapper
Extension to Analyzer suitable for Analyzers which wrap other Analyzers.
GetWrappedAnalyzer(String) allows the Analyzer to wrap multiple Analyzers which are selected on a per field basis.
WrapComponents(String, TokenStreamComponents) allows the TokenStreamComponents of the wrapped Analyzer to then be wrapped (such as adding a new TokenFilter to form new TokenStreamComponents).
BaseTokenStreamTestCase
BaseTokenStreamTestCase.CheckClearAttributesAttribute
Attribute that records if it was cleared or not. this is used for testing that ClearAttributes() was called correctly.
CachingTokenFilter
This class can be used if the token attributes of a TokenStream are intended to be consumed more than once. It caches all token attribute states locally in a List.
CachingTokenFilter implements the optional method Reset(), which repositions the stream to the first Token.
CannedBinaryTokenStream
TokenStream from a canned list of binary (BytesRef-based) tokens.
CannedBinaryTokenStream.BinaryTermAttribute
Implementation for CannedBinaryTokenStream.IBinaryTermAttribute.
CannedBinaryTokenStream.BinaryToken
Represents a binary token.
CannedTokenStream
TokenStream from a canned list of Tokens.
CharFilter
Subclasses of CharFilter can be chained to filter a
This class is abstract: at a minimum you must implement
You can optionally provide more efficient implementations of additional methods
like
For examples and integration with Analyzer, see the Lucene.Net.Analysis namespace documentation.
CollationTestBase
base test class for testing Unicode collation.
LookaheadTokenFilter
An abstract TokenFilter to make it easier to build graph token filters requiring some lookahead. this class handles the details of buffering up tokens, recording them by position, restoring them, providing access to them, etc.
LookaheadTokenFilter.Position
Holds all state for a single position; subclass this to record other state at each position.
LookaheadTokenFilter<T>
MockAnalyzer
MockBytesAnalyzer
Analyzer for testing that encodes terms as UTF-16 bytes.
MockBytesAttributeFactory
Attribute factory that implements CharTermAttribute with MockUTF16TermAttributeImpl
MockCharFilter
the purpose of this charfilter is to send offsets out of bounds if the analyzer doesn't use correctOffset or does incorrect offset math.
MockFixedLengthPayloadFilter
TokenFilter that adds random fixed-length payloads.
MockGraphTokenFilter
Randomly inserts overlapped (posInc=0) tokens with posLength sometimes > 1. The chain must have an OffsetAttribute.
MockHoleInjectingTokenFilter
Randomly injects holes (similar to what a stopfilter would do)
MockPayloadAnalyzer
Wraps a whitespace tokenizer with a filter that sets the first token, and odd tokens to posinc=1, and all others to 0, encoding the position as pos: XXX in the payload.
MockRandomLookaheadTokenFilter
Uses LookaheadTokenFilter to randomly peek at future tokens.
MockReaderWrapper
Wraps a Reader, and can throw random or fixed exceptions, and spoon feed read chars.
MockTokenFilter
MockTokenizer
MockUTF16TermAttributeImpl
Extension of
MockVariableLengthPayloadFilter
TokenFilter that adds random variable-length payloads.
NumericTokenStream
Expert: this class provides a TokenStream for indexing numeric values that can be used by NumericRangeQuery or NumericRangeFilter.
Note that for simple usage, Int32Field, Int64Field, SingleField or DoubleField is recommended. These fields disable norms and term freqs, as they are not usually needed during searching. If you need to change these settings, you should use this class.
Here's an example usage, for an
IndexableFieldType fieldType = new IndexableFieldType(TextField.TYPE_NOT_STORED)
{
OmitNorms = true,
IndexOptions = IndexOptions.DOCS_ONLY
};
Field field = new Field(name, new NumericTokenStream(precisionStep).SetInt32Value(value), fieldType);
document.Add(field);
For optimal performance, re-use the TokenStream and Field instance for more than one document:
NumericTokenStream stream = new NumericTokenStream(precisionStep);
IndexableFieldType fieldType = new IndexableFieldType(TextField.TYPE_NOT_STORED)
{
OmitNorms = true,
IndexOptions = IndexOptions.DOCS_ONLY
};
Field field = new Field(name, stream, fieldType);
Document document = new Document();
document.Add(field);
for(all documents)
{
stream.SetInt32Value(value)
writer.AddDocument(document);
}
this stream is not intended to be used in analyzers; it's more for iterating the different precisions during indexing a specific numeric value.
NOTE: as token streams are only consumed once the document is added to the index, if you index more than one numeric field, use a separate NumericTokenStream instance for each.
See NumericRangeQuery for more details on the
precisionStep
parameter as well as how numeric fields work under the hood.
@since 2.9
NumericTokenStream.NumericTermAttribute
Implementation of NumericTokenStream.INumericTermAttribute. @lucene.internal @since 4.0
ReusableStringReader
Internal class to enable reuse of the string reader by GetTokenStream(String, String)
ReuseStrategy
Strategy defining how TokenStreamComponents are reused per call to GetTokenStream(String, TextReader).
Token
A Token is an occurrence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string.
The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.
The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".
A Token can optionally have metadata (a.k.a. payload) in the form of a variable length byte array. Use GetPayload() to retrieve the payloads from the index.
NOTE: As of 2.9, Token implements all IAttribute interfaces that are part of core Lucene and can be found in the Lucene.Net.Analysis.TokenAttributes namespace. Even though it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all IAttributes, which is especially useful to easily switch from the old to the new TokenStream API.
Tokenizers and TokenFilters should try to re-use a Token
instance when possible for best performance, by
implementing the IncrementToken() API.
Failing that, to create a new Token you should first use
one of the constructors that starts with null text. To load
the token from a char[] use CopyBuffer(Char[], Int32, Int32).
To load from a
Typical Token reuse patterns:
- Copying text from a string (type is reset to DEFAULT_TYPE if not specified):
return reusableToken.Reinit(string, startOffset, endOffset[, type]);
- Copying some text from a string (type is reset to DEFAULT_TYPE if not specified):
return reusableToken.Reinit(string, 0, string.Length, startOffset, endOffset[, type]);
- Copying text from char[] buffer (type is reset to DEFAULT_TYPE if not specified):
return reusableToken.Reinit(buffer, 0, buffer.Length, startOffset, endOffset[, type]);
- Copying some text from a char[] buffer (type is reset to DEFAULT_TYPE if not specified):
return reusableToken.Reinit(buffer, start, end - start, startOffset, endOffset[, type]);
- Copying from one one Token to another (type is reset to DEFAULT_TYPE if not specified):
return reusableToken.Reinit(source.Buffer, 0, source.Length, source.StartOffset, source.EndOffset[, source.Type]);
- Clear() initializes all of the fields to default values. this was changed in contrast to Lucene 2.4, but should affect no one.
- Because TokenStreams can be chained, one cannot assume that the Token's current type is correct.
- The startOffset and endOffset represent the start and offset in the source text, so be careful in adjusting them.
- When caching a reusable token, clone it. When injecting a cached token into a stream that can be reset, clone it again.
Please note: With Lucene 3.1, the ToString() method had to be changed to match the ICharSequence interface introduced by the interface ICharTermAttribute. this method now only prints the term text, no additional information anymore.
Token.TokenAttributeFactory
Expert: Creates a Token.TokenAttributeFactory returning Token as instance for the basic attributes and for all other attributes calls the given delegate factory. @since 3.0
TokenFilter
A TokenFilter is a TokenStream whose input is another TokenStream.
This is an abstract class; subclasses must override IncrementToken().
Tokenizer
A Tokenizer is a TokenStream whose input is a
This is an abstract class; subclasses must override IncrementToken()
NOTE: Subclasses overriding IncrementToken() must call ClearAttributes() before setting attributes.
TokenStream
A TokenStream enumerates the sequence of tokens, either from Fields of a Document or from query text.
this is an abstract class; concrete subclasses are:
- Tokenizer, a TokenStream whose input is a
; and - TokenFilter, a TokenStream whose input is another TokenStream.
TokenStream now extends AttributeSource, which provides
access to all of the token IAttributes for the TokenStream.
Note that only one instance per
The workflow of the new TokenStream API is as follows:
- Instantiation of TokenStream/TokenFilters which add/get attributes to/from the AttributeSource.
- The consumer calls Reset().
- The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.
- The consumer calls IncrementToken() until it returns false consuming the attributes after each call.
- The consumer calls End() so that any end-of-stream operations can be performed.
- The consumer calls Dispose() to release any resource when finished using the TokenStream.
You can find some example code for the new API in the analysis documentation.
Sometimes it is desirable to capture a current state of a TokenStream, e.g., for buffering purposes (see CachingTokenFilter, TeeSinkTokenFilter). For this usecase CaptureState() and RestoreState(AttributeSource.State) can be used.
The TokenStream-API in Lucene is based on the decorator pattern. Therefore all non-abstract subclasses must be sealed or have at least a sealed implementation of IncrementToken()! This is checked when assertions are enabled.
TokenStreamComponents
This class encapsulates the outer components of a token stream. It provides access to the source (Tokenizer) and the outer end (sink), an instance of TokenFilter which also serves as the TokenStream returned by GetTokenStream(String, TextReader).
TokenStreamToAutomaton
Consumes a TokenStream and creates an Automaton where the transition labels are UTF8 bytes (or Unicode code points if unicodeArcs is true) from the ITermToBytesRefAttribute. Between tokens we insert POS_SEP and for holes we insert HOLE.
@lucene.experimental
TokenStreamToDot
Consumes a TokenStream and outputs the dot (graphviz) string (graph).
ValidatingTokenFilter
A TokenFilter that checks consistency of the tokens (eg offsets are consistent with one another).
VocabularyAssert
Utility class for doing vocabulary-based stemming tests
Interfaces
BaseTokenStreamTestCase.ICheckClearAttributesAttribute
Attribute that records if it was cleared or not. this is used for testing that ClearAttributes() was called correctly.
CannedBinaryTokenStream.IBinaryTermAttribute
An attribute extending
NumericTokenStream.INumericTermAttribute
Expert: Use this attribute to get the details of the currently generated token. @lucene.experimental @since 4.0