Class CJKBigramFilter
Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.
CJK types are set by these tokenizers, but you can also use CJKBigramFilter(TokenStream, CJKScript) to explicitly control which of the CJK scripts are turned into bigrams.
By default, when a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you want to always output both unigrams and bigrams, set the
outputUnigrams
flag in CJKBigramFilter(TokenStream, CJKScript, Boolean).
This can be used for a combined unigram+bigram approach.
In all cases, all non-CJK input is passed thru unmodified.
Inherited Members
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Serializable]
public sealed class CJKBigramFilter : TokenFilter, IDisposable
Constructors
Name | Description |
---|---|
CJKBigramFilter(TokenStream) | |
CJKBigramFilter(TokenStream, CJKScript) | |
CJKBigramFilter(TokenStream, CJKScript, Boolean) | Create a new CJKBigramFilter, specifying which writing systems should be bigrammed, and whether or not unigrams should also be output. |
Fields
Name | Description |
---|---|
DOUBLE_TYPE | when we emit a bigram, its then marked as this type |
SINGLE_TYPE | when we emit a unigram, its then marked as this type |
Methods
Name | Description |
---|---|
IncrementToken() | |
Reset() |