Class JapaneseTokenizer
Tokenizer for Japanese that uses morphological analysis.
Inherited Members
Assembly: Lucene.Net.Analysis.Kuromoji.dll
Syntax
public sealed class JapaneseTokenizer : Tokenizer, IDisposable
Remarks
This tokenizer sets a number of additional attributes:
- IBaseFormAttribute containing base form for inflected adjectives and verbs.
- IPartOfSpeechAttribute containing part-of-speech.
- IReadingAttribute containing reading and pronunciation.
- IInflectionAttribute containing additional part-of-speech information for inflected forms.
This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is SEARCH, we output the alternate segmentation as well.
Constructors
Name | Description |
---|---|
JapaneseTokenizer(AttributeSource.AttributeFactory, TextReader, UserDictionary, Boolean, JapaneseTokenizerMode) | Create a new JapaneseTokenizer. |
JapaneseTokenizer(TextReader, UserDictionary, Boolean, JapaneseTokenizerMode) | Create a new JapaneseTokenizer. Uses the default AttributeFactory. |
Fields
Name | Description |
---|---|
DEFAULT_MODE | Default tokenization mode. Currently this is SEARCH. |
Properties
Name | Description |
---|---|
GraphvizFormatter | Expert: set this to produce graphviz (dot) output of the Viterbi lattice |
Methods
Name | Description |
---|---|
Dispose(Boolean) | |
End() | |
IncrementToken() | |
Reset() |