Class UAX29URLEmailTokenizerImpl
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
- <KATAKANA>: A sequence of katakana characters
- <HANGUL>: A sequence of Hangul characters
Inheritance
Inherited Members
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Serializable]
public sealed class UAX29URLEmailTokenizerImpl : IStandardTokenizerInterface
Constructors
Name | Description |
---|---|
UAX29URLEmailTokenizerImpl(TextReader) | Creates a new scanner |
Fields
Name | Description |
---|---|
AVOID_BAD_URL | |
EMAIL_TYPE | |
HANGUL_TYPE | |
HIRAGANA_TYPE | |
IDEOGRAPHIC_TYPE | |
KATAKANA_TYPE | |
NUMERIC_TYPE | Numbers |
SOUTH_EAST_ASIAN_TYPE | Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29. See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA |
URL_TYPE | |
WORD_TYPE | Alphanumeric sequences |
YYEOF | This character denotes the end of file |
YYINITIAL | lexical states |
Properties
Name | Description |
---|---|
YyChar | |
YyLength | Returns the length of the matched text region. |
YyState | Returns the current lexical state. |
YyText | Returns the text matched by the current regular expression. |
Methods
Name | Description |
---|---|
GetNextToken() | Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. |
GetText(ICharTermAttribute) | Fills ICharTermAttribute with the current token text. |
YyBegin(Int32) | Enters a new lexical state |
YyCharAt(Int32) | Returns the character at position It is equivalent to YyText[pos], but faster |
YyClose() | Disposes the input stream. |
YyPushBack(Int32) | Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method |
YyReset(TextReader) | Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to YYINITIAL. Internal scan buffer is resized down to its initial length, if it has grown. |