Class ArabicLetterTokenizer
Tokenizer that breaks text into runs of letters and diacritics.
The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.
You must specify the required LuceneVersion compatibility when creating ArabicLetterTokenizer:
- As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See IsTokenChar(Int32) and Normalize(Int32) for details.
Inheritance
System.Object
ArabicLetterTokenizer
Inherited Members
Lucene.Net.Analysis.Tokenizer.SetReader(System.IO.TextReader)
System.Object.Equals(System.Object, System.Object)
System.Object.ReferenceEquals(System.Object, System.Object)
System.Object.GetType()
System.Object.MemberwiseClone()
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Obsolete("(3.1) Use StandardTokenizer instead.")]
[Serializable]
public class ArabicLetterTokenizer : LetterTokenizer, IDisposable
Constructors
Name | Description |
---|---|
ArabicLetterTokenizer(LuceneVersion, AttributeSource.AttributeFactory, TextReader) | Construct a new ArabicLetterTokenizer using a given AttributeSource.AttributeFactory. |
ArabicLetterTokenizer(LuceneVersion, TextReader) | Construct a new ArabicLetterTokenizer. |
Methods
Name | Description |
---|---|
IsTokenChar(Int32) | Allows for Letter category or NonspacingMark category |