Namespace Lucene.Net.Codecs.Bloom
Classes
BloomFilterFactory
Class used to create index-time FuzzySet appropriately configured for each field. Also called to right-size bitsets for serialization.
@lucene.experimental
BloomFilteringPostingsFormat
A PostingsFormat useful for low doc-frequency fields such as primary keys. Bloom filters are maintained in a ".blm" file which offers "fast-fail" for reads in segments known to have no record of the key. A choice of delegate PostingsFormat is used to record all other Postings data.
A choice of BloomFilterFactory can be passed to tailor Bloom Filter settings on a per-field basis. The default configuration is DefaultBloomFilterFactory which allocates a ~8mb bitset and hashes values using MurmurHash2. This should be suitable for most purposes.
The format of the blm file is as follows:
- BloomFilter (.blm) --> Header, DelegatePostingsFormatName, NumFilteredFields, FilterNumFilteredFields, Footer
- Filter --> FieldNumber, FuzzySet
- FuzzySet -->See Serialize(DataOutput)
- Header --> CodecHeader (WriteHeader(DataOutput, String, Int32))
- DelegatePostingsFormatName --> String (WriteString(String)) The name of a ServiceProvider registered PostingsFormat
- NumFilteredFields --> Uint32 (WriteInt32(Int32))
- FieldNumber --> Uint32 (WriteInt32(Int32)) The number of the field in this segment
- Footer --> CodecFooter (WriteFooter(IndexOutput))
@lucene.experimental
DefaultBloomFilterFactory
Default policy is to allocate a bitset with 10% saturation given a unique term per document. Bits are set via MurmurHash2 hashing function.
@lucene.experimental
FuzzySet
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.
Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.
This class is NOT threadsafe.
Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(Single) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.
@lucene.experimental
HashFunction
Base class for hashing functions that can be referred to by name. Subclasses are expected to provide threadsafe implementations of the hash function on the range of bytes referenced in the provided BytesRef.
@lucene.experimental
MurmurHash2
This is a very fast, non-cryptographic hash suitable for general hash-based lookup. See http://murmurhash.googlepages.com/ for more details.
The C version of MurmurHash 2.0 found at that site was ported to Java by Andrzej Bialecki (ab at getopt org).
The code from getopt.org was adapted by Mark Harwood in the form here as one of a pluggable choice of
hashing functions as the core function had to be adapted to work with BytesRefs with offsets and lengths
rather than raw byte arrays.
@lucene.experimental
TestBloomFilteredLucene41Postings
A class used for testing BloomFilteringPostingsFormat with a concrete delegate (Lucene41). Creates a Bloom filter on ALL fields and with tiny amounts of memory reserved for the filter. DO NOT USE IN A PRODUCTION APPLICATION This is not a realistic application of Bloom Filters as they ordinarily are larger and operate on only primary key type fields.