Class SegmentingTokenizerBase
Breaks text into sentences with a BreakIterator and allows subclasses to decompose these sentences into words.
This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
@lucene.experimental
Inherited Members
Assembly: Lucene.Net.ICU.dll
Syntax
public abstract class SegmentingTokenizerBase : Tokenizer, IDisposable
Constructors
Name | Description |
---|---|
SegmentingTokenizerBase(AttributeSource.AttributeFactory, TextReader, BreakIterator) | Construct a new SegmenterBase, also supplying the AttributeSource.AttributeFactory |
SegmentingTokenizerBase(TextReader, BreakIterator) | Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation. Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor. |
Fields
Name | Description |
---|---|
BUFFERMAX | |
m_buffer | |
m_offset | accumulated offset of previous buffers for this reader, for offsetAtt |
Methods
Name | Description |
---|---|
End() | |
IncrementToken() | |
IncrementWord() | Returns true if another word is available |
IsSafeEnd(Char) | For sentence tokenization, these are the unambiguous break positions. |
Reset() | |
SetNextSentence(Int32, Int32) | Provides the next input sentence for analysis |