Class PatternAnalyzer
Efficient Lucene analyzer/tokenizer that preferably operates on a System.String rather than a System.IO.TextReader, that can flexibly separate text into terms via a regular expression System.Text.RegularExpressions.Regex (with behaviour similar to System.Text.RegularExpressions.Regex.Split(System.String)), and that combines the functionality of LetterTokenizer, LowerCaseTokenizer, WhitespaceTokenizer, StopFilter into a single efficient multi-purpose class.
If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via System.Text.RegularExpressions.Regex.Split(System.String). Once you are satisfied, give that regex to PatternAnalyzer. Also see Regular Expression Tutorial.
This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene TokenFilter chain. For example as in this stemming example:
PatternAnalyzer pat = ...
TokenStream tokenStream = new SnowballFilter(
pat.GetTokenStream("content", "James is running round in the woods"),
"English"));
Inherited Members
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Obsolete("(4.0) use the pattern-based analysis in the analysis/pattern package instead.")]
[Serializable]
public sealed class PatternAnalyzer : Analyzer, IDisposable
Constructors
Name | Description |
---|---|
PatternAnalyzer(LuceneVersion, Regex, Boolean, CharArraySet) | Constructs a new instance with the given parameters. |
Fields
Name | Description |
---|---|
DEFAULT_ANALYZER | A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader. |
EXTENDED_ANALYZER | A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html |
NON_WORD_PATTERN |
|
WHITESPACE_PATTERN |
|
Methods
Name | Description |
---|---|
CreateComponents(String, TextReader) | Creates a token stream that tokenizes all the text in the given SetReader; This implementation forwards to Lucene.Net.Analysis.Analyzer.GetTokenStream(System.String, System.IO.TextReader) and is less efficient than Lucene.Net.Analysis.Analyzer.GetTokenStream(System.String, System.IO.TextReader). |
CreateComponents(String, TextReader, String) | Creates a token stream that tokenizes the given string into token terms (aka words). |
Equals(Object) | Indicates whether some other object is "equal to" this one. |
GetHashCode() | Returns a hash code value for the object. |