Class UAX29URLEmailTokenizer
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in ` Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
You must specify the required LuceneVersion compatibility when creating UAX29URLEmailTokenizer:
- As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility.
Inherited Members
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Serializable]
public sealed class UAX29URLEmailTokenizer : Tokenizer, IDisposable
Constructors
Name | Description |
---|---|
UAX29URLEmailTokenizer(LuceneVersion, AttributeSource.AttributeFactory, TextReader) | Creates a new UAX29URLEmailTokenizer with a given AttributeSource.AttributeFactory |
UAX29URLEmailTokenizer(LuceneVersion, TextReader) | Creates a new instance of the UAX29URLEmailTokenizer. Attaches
the |
Fields
Name | Description |
---|---|
ALPHANUM | |
HANGUL | |
HIRAGANA | |
IDEOGRAPHIC | |
KATAKANA | |
NUM | |
SOUTHEAST_ASIAN | |
TOKEN_TYPES | String token types that correspond to token type int constants |
URL |
Properties
Name | Description |
---|---|
MaxTokenLength | Set the max allowed token length. Any token longer than this is skipped. |
Methods
Name | Description |
---|---|
Dispose(Boolean) | |
End() | |
IncrementToken() | |
Reset() |