Class WikipediaTokenizer
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
@lucene.experimental
Inherited Members
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Serializable]
public sealed class WikipediaTokenizer : Tokenizer, IDisposable
Constructors
Name | Description |
---|---|
WikipediaTokenizer(AttributeSource.AttributeFactory, TextReader, Int32, ICollection<String>) | Creates a new instance of the WikipediaTokenizer. Attaches the
|
WikipediaTokenizer(TextReader) | Creates a new instance of the WikipediaTokenizer. Attaches the
|
WikipediaTokenizer(TextReader, Int32, ICollection<String>) | Creates a new instance of the WikipediaTokenizer. Attaches the
|
Fields
Name | Description |
---|---|
ACRONYM_ID | |
ALPHANUM_ID | |
APOSTROPHE_ID | |
BOLD | |
BOLD_ID | |
BOLD_ITALICS | |
BOLD_ITALICS_ID | |
BOTH | Output the both the untokenized token and the splits |
CATEGORY | |
CATEGORY_ID | |
CITATION | |
CITATION_ID | |
CJ_ID | |
COMPANY_ID | |
EMAIL_ID | |
EXTERNAL_LINK | |
EXTERNAL_LINK_ID | |
EXTERNAL_LINK_URL | |
EXTERNAL_LINK_URL_ID | |
HEADING | |
HEADING_ID | |
HOST_ID | |
INTERNAL_LINK | |
INTERNAL_LINK_ID | |
ITALICS | |
ITALICS_ID | |
NUM_ID | |
SUB_HEADING | |
SUB_HEADING_ID | |
TOKEN_TYPES | String token types that correspond to token type int constants |
TOKENS_ONLY | Only output tokens |
UNTOKENIZED_ONLY | Only output untokenized tokens, which are tokens that would normally be split into several tokens |
UNTOKENIZED_TOKEN_FLAG | This flag is used to indicate that the produced "Token" would, if TOKENS_ONLY was used, produce multiple tokens. |
Methods
Name | Description |
---|---|
Dispose(Boolean) | |
End() | |
IncrementToken() | IncrementToken() |
Reset() | Reset() |