Class DocTermOrds

This class enables fast access to multiple term ords for a specified field across all docIDs.

Like IFieldCache, it uninverts the index and holds a packed data structure in RAM to enable fast access. Unlike IFieldCache, it can handle multi-valued fields, and, it does not hold the term bytes in RAM. Rather, you must obtain a TermsEnum from the GetOrdTermsEnum(AtomicReader) method, and then seek-by-ord to get the term's bytes.

While normally term ords are type , in this API they are as the internal representation here cannot address more than MAX_INT32 unique terms. Also, typically this class is used on fields with relatively few unique terms vs the number of documents. In addition, there is an internal limit (16 MB) on how many bytes each chunk of documents may consume. If you trip this limit you'll hit an .

Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords.

The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparer) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once).

This class tests whether the provided reader is able to retrieve terms by ord (ie, it's single segment, and it uses an ord-capable terms index). If not, this class will create its own term index internally, allowing to create a wrapped TermsEnum that can handle ord. The GetOrdTermsEnum(AtomicReader) method then provides this wrapped enum, if necessary.

The RAM consumption of this class can be high!

@lucene.experimental

Inheritance

System.Object

DocTermOrds

Assembly: DistributedLucene.Net.dll

Syntax

public class DocTermOrds : object

Remarks

Final form of the un-inverted field:

Each document points to a list of term numbers that are contained in that document.
Term numbers are in sorted order, and are encoded as variable-length deltas from the previous term number. Real term numbers start at 2 since 0 and 1 are reserved. A term number of 0 signals the end of the termNumber list.
There is a single int[maxDoc()] which either contains a pointer into a byte[] for the termNumber lists, or directly contains the termNumber list if it fits in the 4 bytes of an integer. If the first byte in the integer is 1, the next 3 bytes are a pointer into a byte[] where the termNumber list starts.
There are actually 256 byte arrays, to compensate for the fact that the pointers into the byte arrays are only 3 bytes long. The correct byte array for a document is a function of it's id.
To save space and speed up faceting, any term that matches enough documents will not be un-inverted... it will be skipped while building the un-inverted field structure, and will use a set intersection method during faceting.
To further save memory, the terms (the actual string values) are not all stored in memory, but a TermIndex is used to convert term numbers to term values only for the terms needed after faceting has completed. Only every 128th term value is stored, along with it's corresponding term number, and this is used as an index to find the closest term and iterate until the desired number is hit (very much like Lucene's own internal term index).

Constructors

Name	Description
DocTermOrds(AtomicReader, IBits, String)	Inverts all terms
DocTermOrds(AtomicReader, IBits, String, BytesRef)	Inverts only terms starting w/ prefix
DocTermOrds(AtomicReader, IBits, String, BytesRef, Int32)	Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= `maxTermDocFreq`
DocTermOrds(AtomicReader, IBits, String, BytesRef, Int32, Int32)	Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= `maxTermDocFreq`, with a custom indexing interval (default is every 128nd term).
DocTermOrds(String, Int32, Int32)	Subclass inits w/ this, but be sure you then call uninvert, only once

Fields

Name	Description
DEFAULT_INDEX_INTERVAL_BITS	Every 128th term is indexed, by default.
m_docsEnum	Used while uninverting.
m_field	Field we are uninverting.
m_index	Holds the per-document ords or a pointer to the ords.
m_indexedTermsArray	Holds the indexed (by default every 128th) terms.
m_maxTermDocFreq	Don't uninvert terms that exceed this count.
m_numTermsInField	Number of terms in the field.
m_ordBase	Ordinal of the first term in the field, or 0 if the PostingsFormat does not implement Ord.
m_phase1_time	Time for phase1 of the uninvert process.
m_prefix	If non-null, only terms matching this prefix were indexed.
m_sizeOfIndexedStrings	Total bytes (sum of term lengths) for all indexed terms.
m_termInstances	Total number of references to term numbers.
m_tnums	Holds term ords for documents.
m_total_time	Total time to uninvert the field.

Properties

Name	Description
IsEmpty	Returns `true` if no terms were indexed.
NumTerms	Returns the number of terms in this field

Methods

Name	Description
GetIterator(AtomicReader)	Returns a SortedSetDocValues view of this instance
GetOrdTermsEnum(AtomicReader)	Returns a TermsEnum that implements Ord. If the provided `reader` supports Ord, we just return its TermsEnum; if it does not, we build a "private" terms index internally (WARNING: consumes RAM) and use that index to implement Ord. This also enables Ord on top of a composite reader. The returned TermsEnum is unpositioned. This returns `null` if there are no terms. NOTE: you must pass the same reader that was used when creating this class
LookupTerm(TermsEnum, Int32)	Returns the term (BytesRef) corresponding to the provided ordinal.
RamUsedInBytes()	Returns total bytes used.
SetActualDocFreq(Int32, Int32)	Invoked during Uninvert(AtomicReader, IBits, BytesRef) to record the document frequency for each uninverted term.
Uninvert(AtomicReader, IBits, BytesRef)	Call this only once (if you subclass!)
VisitTerm(TermsEnum, Int32)	Subclass can override this

Extension Methods

Number.IsNumber(Object)

SystemTypesHelpers.toString(Object)

SystemTypesHelpers.equals(Object, Object)