Class UAX29URLEmailTokenizer

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in ` Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Tokens produced are of the following types:

<ALPHANUM>: A sequence of alphabetic and numeric characters
<NUM>: A number
<URL>: A URL
<EMAIL>: An email address
<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
<IDEOGRAPHIC>: A single CJKV ideographic character
<HIRAGANA>: A single hiragana character

You must specify the required LuceneVersion compatibility when creating UAX29URLEmailTokenizer:

As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility.

Inheritance

System.Object

AttributeSource

TokenStream

Tokenizer

UAX29URLEmailTokenizer

Inherited Members

Tokenizer.m_input

Tokenizer.CorrectOffset(Int32)

Lucene.Net.Analysis.Tokenizer.SetReader(System.IO.TextReader)

TokenStream.Dispose()

AttributeSource.GetAttributeFactory()

AttributeSource.GetAttributeClassesEnumerator()

AttributeSource.GetAttributeImplsEnumerator()

AttributeSource.AddAttributeImpl(Attribute)

AttributeSource.AddAttribute<T>()

AttributeSource.HasAttributes

AttributeSource.HasAttribute<T>()

AttributeSource.GetAttribute<T>()

AttributeSource.ClearAttributes()

AttributeSource.CaptureState()

AttributeSource.RestoreState(AttributeSource.State)

AttributeSource.GetHashCode()

AttributeSource.Equals(Object)

AttributeSource.ReflectAsString(Boolean)

AttributeSource.ReflectWith(IAttributeReflector)

AttributeSource.CloneAttributes()

AttributeSource.CopyTo(AttributeSource)

AttributeSource.ToString()

System.Object.Equals(System.Object, System.Object)

System.Object.ReferenceEquals(System.Object, System.Object)

System.Object.GetType()

System.Object.MemberwiseClone()

Assembly: Lucene.Net.Analysis.Common.dll

Syntax

[Serializable]
public sealed class UAX29URLEmailTokenizer : Tokenizer, IDisposable

Constructors

Name	Description
UAX29URLEmailTokenizer(LuceneVersion, AttributeSource.AttributeFactory, TextReader)	Creates a new UAX29URLEmailTokenizer with a given AttributeSource.AttributeFactory
UAX29URLEmailTokenizer(LuceneVersion, TextReader)	Creates a new instance of the UAX29URLEmailTokenizer. Attaches the `input` to the newly created JFlex scanner.

Fields

Name	Description
ALPHANUM
EMAIL
HANGUL
HIRAGANA
IDEOGRAPHIC
KATAKANA
NUM
SOUTHEAST_ASIAN
TOKEN_TYPES	String token types that correspond to token type int constants
URL

Properties

Name	Description
MaxTokenLength	Set the max allowed token length. Any token longer than this is skipped.

Methods

Name	Description
Dispose(Boolean)
End()
IncrementToken()
Reset()

Extension Methods

Number.IsNumber(Object)

SystemTypesHelpers.toString(Object)

SystemTypesHelpers.equals(Object, Object)