Class CommonGramsFilter
Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. This is achieved through the use of PositionIncrement. Bigrams have a type of GRAM_TYPE Example:
- input:"the quick brown fox"
- output:|"the","the-quick"|"brown"|"fox"|
- "the-quick" has a position increment of 0 so it is in the same position as "the" "the-quick" has a term.type() of "gram"
Inherited Members
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Serializable]
public sealed class CommonGramsFilter : TokenFilter, IDisposable
Constructors
Name | Description |
---|---|
CommonGramsFilter(LuceneVersion, TokenStream, CharArraySet) | Construct a token stream filtering the given input using a Set of common words to create bigrams. Outputs both unigrams with position increment and bigrams with position increment 0 type=gram where one or both of the words in a potential bigram are in the set of common words . |
Fields
Name | Description |
---|---|
GRAM_TYPE |
Methods
Name | Description |
---|---|
IncrementToken() | Inserts bigrams for common words into a token stream. For each input token, output the token. If the token and/or the following token are in the list of common words also output a bigram with position increment 0 and type="gram" TODO:Consider adding an option to not emit unigram stopwords as in CDL XTF BigramStopFilter, CommonGramsQueryFilter would need to be changed to work with this. TODO: Consider optimizing for the case of three commongrams i.e "man of the year" normally produces 3 bigrams: "man-of", "of-the", "the-year" but with proper management of positions we could eliminate the middle bigram "of-the"and save a disk seek and a whole set of position lookups. |
Reset() | This method is called by a consumer before it begins consumption using IncrementToken(). Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call |