Class StandardTokenizer

A grammar-based tokenizer constructed with JFlex.

As of Lucene version 3.1, this class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

You must specify the required LuceneVersion compatibility when creating StandardTokenizer:

As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility.
As of 3.1, StandardTokenizer implements Unicode text segmentation. If you use a previous version number, you get the exact behavior of ClassicTokenizer for backwards compatibility.

Inheritance

System.Object

AttributeSource

TokenStream

Tokenizer

StandardTokenizer

Inherited Members

Tokenizer.m_input

Tokenizer.CorrectOffset(Int32)

Lucene.Net.Analysis.Tokenizer.SetReader(System.IO.TextReader)

TokenStream.Dispose()

AttributeSource.GetAttributeFactory()

AttributeSource.GetAttributeClassesEnumerator()

AttributeSource.GetAttributeImplsEnumerator()

AttributeSource.AddAttributeImpl(Attribute)

AttributeSource.AddAttribute<T>()

AttributeSource.HasAttributes

AttributeSource.HasAttribute<T>()

AttributeSource.GetAttribute<T>()

AttributeSource.ClearAttributes()

AttributeSource.CaptureState()

AttributeSource.RestoreState(AttributeSource.State)

AttributeSource.GetHashCode()

AttributeSource.Equals(Object)

AttributeSource.ReflectAsString(Boolean)

AttributeSource.ReflectWith(IAttributeReflector)

AttributeSource.CloneAttributes()

AttributeSource.CopyTo(AttributeSource)

AttributeSource.ToString()

System.Object.Equals(System.Object, System.Object)

System.Object.ReferenceEquals(System.Object, System.Object)

System.Object.GetType()

System.Object.MemberwiseClone()

Assembly: Lucene.Net.Analysis.Common.dll

Syntax

[Serializable]
public sealed class StandardTokenizer : Tokenizer, IDisposable

Constructors

Name	Description
StandardTokenizer(LuceneVersion, AttributeSource.AttributeFactory, TextReader)	Creates a new StandardTokenizer with a given AttributeSource.AttributeFactory
StandardTokenizer(LuceneVersion, TextReader)	Creates a new instance of the StandardTokenizer. Attaches the `input` to the newly created JFlex-generated (then ported to .NET) scanner.

Fields

Name	Description
ACRONYM
ACRONYM_DEP
ALPHANUM
APOSTROPHE
CJ
COMPANY
EMAIL
HANGUL
HIRAGANA
HOST
IDEOGRAPHIC
KATAKANA
NUM
SOUTHEAST_ASIAN
TOKEN_TYPES	String token types that correspond to token type int constants

Properties

Name	Description
MaxTokenLength	Set the max allowed token length. Any token longer than this is skipped.

Methods

Name	Description
Dispose(Boolean)
End()
IncrementToken()
Reset()

Extension Methods

Number.IsNumber(Object)

SystemTypesHelpers.toString(Object)

SystemTypesHelpers.equals(Object, Object)