Class ClassicTokenizer

A grammar-based tokenizer constructed with JFlex (and then ported to .NET)

This should be a good tokenizer for most European-language documents:

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1, StandardTokenizer implements Unicode text segmentation, as specified by UAX#29.

Inheritance

System.Object

AttributeSource

TokenStream

Tokenizer

ClassicTokenizer

Inherited Members

Tokenizer.m_input

Tokenizer.CorrectOffset(Int32)

Lucene.Net.Analysis.Tokenizer.SetReader(System.IO.TextReader)

TokenStream.Dispose()

AttributeSource.GetAttributeFactory()

AttributeSource.GetAttributeClassesEnumerator()

AttributeSource.GetAttributeImplsEnumerator()

AttributeSource.AddAttributeImpl(Attribute)

AttributeSource.AddAttribute<T>()

AttributeSource.HasAttributes

AttributeSource.HasAttribute<T>()

AttributeSource.GetAttribute<T>()

AttributeSource.ClearAttributes()

AttributeSource.CaptureState()

AttributeSource.RestoreState(AttributeSource.State)

AttributeSource.GetHashCode()

AttributeSource.Equals(Object)

AttributeSource.ReflectAsString(Boolean)

AttributeSource.ReflectWith(IAttributeReflector)

AttributeSource.CloneAttributes()

AttributeSource.CopyTo(AttributeSource)

AttributeSource.ToString()

System.Object.Equals(System.Object, System.Object)

System.Object.ReferenceEquals(System.Object, System.Object)

System.Object.GetType()

System.Object.MemberwiseClone()

Assembly: Lucene.Net.Analysis.Common.dll

Syntax

[Serializable]
public sealed class ClassicTokenizer : Tokenizer, IDisposable

Constructors

Name	Description
ClassicTokenizer(LuceneVersion, AttributeSource.AttributeFactory, TextReader)	Creates a new ClassicTokenizer with a given AttributeSource.AttributeFactory
ClassicTokenizer(LuceneVersion, TextReader)	Creates a new instance of the ClassicTokenizer. Attaches the `input` to the newly created JFlex scanner.

Fields

Name	Description
ACRONYM
ACRONYM_DEP
ALPHANUM
APOSTROPHE
CJ
COMPANY
EMAIL
HOST
NUM
TOKEN_TYPES	String token types that correspond to token type int constants

Properties

Name	Description
MaxTokenLength	Set the max allowed token length. Any token longer than this is skipped.

Methods

Name	Description
Dispose(Boolean)
End()
IncrementToken()
Reset()

Extension Methods

Number.IsNumber(Object)

SystemTypesHelpers.toString(Object)

SystemTypesHelpers.equals(Object, Object)