Class FuzzySet
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.
Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.
This class is NOT threadsafe.
Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(Single) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.
@lucene.experimental
Inheritance
Assembly: Lucene.Net.Codecs.dll
Syntax
public class FuzzySet : object
Fields
Name | Description |
---|---|
VERSION_CURRENT | |
VERSION_SPI | |
VERSION_START |
Methods
Name | Description |
---|---|
AddValue(BytesRef) | Records a value in the set. The referenced bytes are hashed and then modulo n'd where n is the chosen size of the internal bitset. |
Contains(BytesRef) | The main method required for a Bloom filter which, given a value determines set membership.
Unlike a conventional set, the fuzzy set returns NO or
MAYBE rather than |
CreateSetBasedOnMaxMemory(Int32) | |
CreateSetBasedOnQuality(Int32, Single) | |
Deserialize(DataInput) | |
Downsize(Single) | |
GetEstimatedNumberUniqueValuesAllowingForCollisions(Int32, Int32) | Given a |
GetEstimatedUniqueValues() | |
GetNearestSetSize(Int32) | Rounds down required |
GetNearestSetSize(Int32, Single) | Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem. |
GetSaturation() | |
HashFunctionForVersion(Int32) | |
RamBytesUsed() | |
Serialize(DataOutput) | Serializes the data set to file using the following format:
|