Class JapaneseTokenizer

Tokenizer for Japanese that uses morphological analysis.

Inheritance

System.Object

AttributeSource

TokenStream

Tokenizer

JapaneseTokenizer

Inherited Members

Tokenizer.m_input

Tokenizer.CorrectOffset(Int32)

Lucene.Net.Analysis.Tokenizer.SetReader(System.IO.TextReader)

TokenStream.Dispose()

AttributeSource.GetAttributeFactory()

AttributeSource.GetAttributeClassesEnumerator()

AttributeSource.GetAttributeImplsEnumerator()

AttributeSource.AddAttributeImpl(Attribute)

AttributeSource.AddAttribute<T>()

AttributeSource.HasAttributes

AttributeSource.HasAttribute<T>()

AttributeSource.GetAttribute<T>()

AttributeSource.ClearAttributes()

AttributeSource.CaptureState()

AttributeSource.RestoreState(AttributeSource.State)

AttributeSource.GetHashCode()

AttributeSource.Equals(Object)

AttributeSource.ReflectAsString(Boolean)

AttributeSource.ReflectWith(IAttributeReflector)

AttributeSource.CloneAttributes()

AttributeSource.CopyTo(AttributeSource)

AttributeSource.ToString()

Assembly: Lucene.Net.Analysis.Kuromoji.dll

Syntax

public sealed class JapaneseTokenizer : Tokenizer, IDisposable

Remarks

This tokenizer sets a number of additional attributes:

IBaseFormAttribute containing base form for inflected adjectives and verbs.
IPartOfSpeechAttribute containing part-of-speech.
IReadingAttribute containing reading and pronunciation.
IInflectionAttribute containing additional part-of-speech information for inflected forms.

This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is SEARCH, we output the alternate segmentation as well.

Constructors

Name	Description
JapaneseTokenizer(AttributeSource.AttributeFactory, TextReader, UserDictionary, Boolean, JapaneseTokenizerMode)	Create a new JapaneseTokenizer.
JapaneseTokenizer(TextReader, UserDictionary, Boolean, JapaneseTokenizerMode)	Create a new JapaneseTokenizer. Uses the default AttributeFactory.

Fields

Name	Description
DEFAULT_MODE	Default tokenization mode. Currently this is SEARCH.

Properties

Name	Description
GraphvizFormatter	Expert: set this to produce graphviz (dot) output of the Viterbi lattice

Methods

Name	Description
Dispose(Boolean)
End()
IncrementToken()
Reset()

Extension Methods

Number.IsNumber(Object)

SystemTypesHelpers.toString(Object)

SystemTypesHelpers.equals(Object, Object)