Class DocTermOrds
This class enables fast access to multiple term ords for a specified field across all docIDs.
Like IFieldCache, it uninverts the index and holds a packed data structure in RAM to enable fast access. Unlike IFieldCache, it can handle multi-valued fields, and, it does not hold the term bytes in RAM. Rather, you must obtain a TermsEnum from the GetOrdTermsEnum(AtomicReader) method, and then seek-by-ord to get the term's bytes.
While normally term ords are type
Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords.
The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparer) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once).
This class tests whether the provided reader is able to retrieve terms by ord (ie, it's single segment, and it uses an ord-capable terms index). If not, this class will create its own term index internally, allowing to create a wrapped TermsEnum that can handle ord. The GetOrdTermsEnum(AtomicReader) method then provides this wrapped enum, if necessary.
The RAM consumption of this class can be high!
@lucene.experimental
Inheritance
Assembly: DistributedLucene.Net.dll
Syntax
public class DocTermOrds : object
Remarks
Final form of the un-inverted field:
- Each document points to a list of term numbers that are contained in that document.
- Term numbers are in sorted order, and are encoded as variable-length deltas from the previous term number. Real term numbers start at 2 since 0 and 1 are reserved. A term number of 0 signals the end of the termNumber list.
- There is a single int[maxDoc()] which either contains a pointer into a byte[] for the termNumber lists, or directly contains the termNumber list if it fits in the 4 bytes of an integer. If the first byte in the integer is 1, the next 3 bytes are a pointer into a byte[] where the termNumber list starts.
- There are actually 256 byte arrays, to compensate for the fact that the pointers into the byte arrays are only 3 bytes long. The correct byte array for a document is a function of it's id.
- To save space and speed up faceting, any term that matches enough documents will not be un-inverted... it will be skipped while building the un-inverted field structure, and will use a set intersection method during faceting.
- To further save memory, the terms (the actual string values) are not all stored in memory, but a TermIndex is used to convert term numbers to term values only for the terms needed after faceting has completed. Only every 128th term value is stored, along with it's corresponding term number, and this is used as an index to find the closest term and iterate until the desired number is hit (very much like Lucene's own internal term index).
Constructors
Name | Description |
---|---|
DocTermOrds(AtomicReader, IBits, String) | Inverts all terms |
DocTermOrds(AtomicReader, IBits, String, BytesRef) | Inverts only terms starting w/ prefix |
DocTermOrds(AtomicReader, IBits, String, BytesRef, Int32) | Inverts only terms starting w/ prefix, and only terms
whose docFreq (not taking deletions into account) is
<= |
DocTermOrds(AtomicReader, IBits, String, BytesRef, Int32, Int32) | Inverts only terms starting w/ prefix, and only terms
whose docFreq (not taking deletions into account) is
<= |
DocTermOrds(String, Int32, Int32) | Subclass inits w/ this, but be sure you then call uninvert, only once |
Fields
Name | Description |
---|---|
DEFAULT_INDEX_INTERVAL_BITS | Every 128th term is indexed, by default. |
m_docsEnum | Used while uninverting. |
m_field | Field we are uninverting. |
m_index | Holds the per-document ords or a pointer to the ords. |
m_indexedTermsArray | Holds the indexed (by default every 128th) terms. |
m_maxTermDocFreq | Don't uninvert terms that exceed this count. |
m_numTermsInField | Number of terms in the field. |
m_ordBase | Ordinal of the first term in the field, or 0 if the PostingsFormat does not implement Ord. |
m_phase1_time | Time for phase1 of the uninvert process. |
m_prefix | If non-null, only terms matching this prefix were indexed. |
m_sizeOfIndexedStrings | Total bytes (sum of term lengths) for all indexed terms. |
m_termInstances | Total number of references to term numbers. |
m_tnums | Holds term ords for documents. |
m_total_time | Total time to uninvert the field. |
Properties
Name | Description |
---|---|
IsEmpty | Returns |
NumTerms | Returns the number of terms in this field |
Methods
Name | Description |
---|---|
GetIterator(AtomicReader) | Returns a SortedSetDocValues view of this instance |
GetOrdTermsEnum(AtomicReader) | Returns a TermsEnum that implements Ord. If the
provided NOTE: you must pass the same reader that was used when creating this class |
LookupTerm(TermsEnum, Int32) | Returns the term (BytesRef) corresponding to the provided ordinal. |
RamUsedInBytes() | Returns total bytes used. |
SetActualDocFreq(Int32, Int32) | Invoked during Uninvert(AtomicReader, IBits, BytesRef) to record the document frequency for each uninverted term. |
Uninvert(AtomicReader, IBits, BytesRef) | Call this only once (if you subclass!) |
VisitTerm(TermsEnum, Int32) | Subclass can override this |