Class StandardTokenizer
A grammar-based tokenizer constructed with JFlex.
As of Lucene version 3.1, this class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
You must specify the required LuceneVersion compatibility when creating StandardTokenizer:
- As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility.
- As of 3.1, StandardTokenizer implements Unicode text segmentation. If you use a previous version number, you get the exact behavior of ClassicTokenizer for backwards compatibility.
Inherited Members
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Serializable]
public sealed class StandardTokenizer : Tokenizer, IDisposable
Constructors
Name | Description |
---|---|
StandardTokenizer(LuceneVersion, AttributeSource.AttributeFactory, TextReader) | Creates a new StandardTokenizer with a given AttributeSource.AttributeFactory |
StandardTokenizer(LuceneVersion, TextReader) | Creates a new instance of the StandardTokenizer. Attaches
the |
Fields
Name | Description |
---|---|
ACRONYM | |
ACRONYM_DEP | |
ALPHANUM | |
APOSTROPHE | |
CJ | |
COMPANY | |
HANGUL | |
HIRAGANA | |
HOST | |
IDEOGRAPHIC | |
KATAKANA | |
NUM | |
SOUTHEAST_ASIAN | |
TOKEN_TYPES | String token types that correspond to token type int constants |
Properties
Name | Description |
---|---|
MaxTokenLength | Set the max allowed token length. Any token longer than this is skipped. |
Methods
Name | Description |
---|---|
Dispose(Boolean) | |
End() | |
IncrementToken() | |
Reset() |