Class SegmentingTokenizerBase

Breaks text into sentences with a BreakIterator and allows subclasses to decompose these sentences into words.

This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.

Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.

@lucene.experimental

Inheritance

System.Object

AttributeSource

TokenStream

Tokenizer

SegmentingTokenizerBase

HMMChineseTokenizer

ThaiTokenizer

Inherited Members

Tokenizer.m_input

Tokenizer.Dispose(Boolean)

Tokenizer.CorrectOffset(Int32)

Lucene.Net.Analysis.Tokenizer.SetReader(System.IO.TextReader)

TokenStream.Dispose()

AttributeSource.GetAttributeFactory()

AttributeSource.GetAttributeClassesEnumerator()

AttributeSource.GetAttributeImplsEnumerator()

AttributeSource.AddAttributeImpl(Attribute)

AttributeSource.AddAttribute<T>()

AttributeSource.HasAttributes

AttributeSource.HasAttribute<T>()

AttributeSource.GetAttribute<T>()

AttributeSource.ClearAttributes()

AttributeSource.CaptureState()

AttributeSource.RestoreState(AttributeSource.State)

AttributeSource.GetHashCode()

AttributeSource.Equals(Object)

AttributeSource.ReflectAsString(Boolean)

AttributeSource.ReflectWith(IAttributeReflector)

AttributeSource.CloneAttributes()

AttributeSource.CopyTo(AttributeSource)

AttributeSource.ToString()

Assembly: Lucene.Net.ICU.dll

Syntax

public abstract class SegmentingTokenizerBase : Tokenizer, IDisposable

Constructors

Name	Description
SegmentingTokenizerBase(AttributeSource.AttributeFactory, TextReader, BreakIterator)	Construct a new SegmenterBase, also supplying the AttributeSource.AttributeFactory
SegmentingTokenizerBase(TextReader, BreakIterator)	Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation. Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

Fields

Name	Description
BUFFERMAX
m_buffer
m_offset	accumulated offset of previous buffers for this reader, for offsetAtt

Methods

Name	Description
End()
IncrementToken()
IncrementWord()	Returns true if another word is available
IsSafeEnd(Char)	For sentence tokenization, these are the unambiguous break positions.
Reset()
SetNextSentence(Int32, Int32)	Provides the next input sentence for analysis

Extension Methods

Number.IsNumber(Object)

SystemTypesHelpers.toString(Object)

SystemTypesHelpers.equals(Object, Object)