Class PatternTokenizerFactory
Factory for PatternTokenizer. This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".
- "pattern" is the regular expression.
- "group" says which group to extract into tokens.
group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): System.Text.RegularExpressions.Regex.Replace(System.String,System.String)
Using group >= 0 selects the matching group as the token. For example, if you have:
pattern = \'([^\']+)\'
group = 0
input = aaa 'bbb' 'ccc'
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)
NOTE: This Tokenizer does not output tokens that are of zero length.
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\'([^\']+)\'" group="1"/>
</analyzer>
</fieldType>
@since solr1.2
Inherited Members
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Serializable]
public class PatternTokenizerFactory : TokenizerFactory
Constructors
Name | Description |
---|---|
PatternTokenizerFactory(IDictionary<String, String>) | Creates a new PatternTokenizerFactory |
Fields
Name | Description |
---|---|
GROUP | |
m_group | |
m_pattern | |
PATTERN |
Methods
Name | Description |
---|---|
Create(AttributeSource.AttributeFactory, TextReader) | Split the input using configured pattern |