Class WordDelimiterFilter
Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:
- split on intra-word delimiters (by default, all non alpha-numeric
characters):
"Wi-Fi"
?"Wi", "Fi"
- split on case transitions:
"PowerShot"
?"Power", "Shot"
- split on letter-number transitions:
"SD500"
?"SD", "500"
- leading and trailing intra-word delimiters on each subword are ignored:
"//hello---there, 'dude'"
?"hello", "there", "dude"
- trailing "'s" are removed for each subword:
"O'Neil's"
?"O", "Neil"
- Note: this step isn't performed in a separate filter because of possible subword combinations.
The combinations parameter affects how subwords are combined:
- combinations="0" causes no subword combinations:
?"PowerShot"
0:"Power", 1:"Shot"
(0 and 1 are the token positions) - combinations="1" means that in addition to the subwords, maximum runs of
non-numeric subwords are catenated and produced at the same position of the
last subword in the run:
"PowerShot"
?0:"Power", 1:"Shot" 1:"PowerShot"
"A's+B's&C's"
-gt;0:"A", 1:"B", 2:"C", 2:"ABC"
"Super-Duper-XL500-42-AutoCoder!"
?0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"
One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).
Inherited Members
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Serializable]
public sealed class WordDelimiterFilter : TokenFilter, IDisposable
Constructors
Name | Description |
---|---|
WordDelimiterFilter(LuceneVersion, TokenStream, WordDelimiterFlags, CharArraySet) | Creates a new WordDelimiterFilter using DEFAULT_WORD_DELIM_TABLE as its charTypeTable |
WordDelimiterFilter(LuceneVersion, TokenStream, Byte[], WordDelimiterFlags, CharArraySet) | Creates a new WordDelimiterFilter |
Fields
Name | Description |
---|---|
ALPHA | |
ALPHANUM | |
DIGIT | |
LOWER | |
SUBWORD_DELIM | |
UPPER |
Methods
Name | Description |
---|---|
IncrementToken() | |
Reset() |