The cardinality estimation algorithm is to use the accuracy to exchange space. To illustrate this, we use three different calculation methods to count the number of different words in all Shakespeare's works. Note that our input dataset adds extra data to be higher than the reference base of the problem. The three technologies are: Java HashSet, Linear ProbabilisTIc Counter, and a Hyper LogLog Counter. The results are as follows:
The table shows that we only use 512 bytes for these words, and the error is within 3%. In contrast, HashMap has the highest count accuracy, but requires nearly 10MB of space, so you can easily see why cardinality estimates are useful. Accuracy is not very important in practical applications. It is a fact that using probability counters can save a lot of space in most network sizes and network calculations.
Furthermore, if we want to implement a feature that records the number of independent IPs that the website visits each day:
Collection implementation:
Use the collection to store the IP of each visitor, get the multiple independent IPs by the nature of the collection (each element in the collection is different), and then derive the number of independent IPs by calling the SCARD command.
For example, the program can use the following code to record the IP of each website visitor on December 5, 2017:
Ip = get_vistor_ip() SADD'2017.12.5::unique::ip'ip
Then use the following code to get the unique IP number for the day:
SCARD'2017.12.5::unique::ip'
Collection implementation problemUsing a string to store each IPv4 address can take up to 15 bytes (in the format 'XXX.XXX.XXX.XXX', such as '202.189.128.186').
The following table shows the amount of memory that is required to use a collection to record a different number of independent IPs:
The number of independent IPs a day, one month a year, one million 15 MB 450 MB 5.4 GB
10 million 150 MB 4.5 GB 54 GB
100 million 1.5 GB 45 GB 540 GB
As the IP of the collection records increases, more and more memory is consumed. Also, if you want to store an IPv6 address, you will need more memory. To better address issues like independent IP address calculations, Redis added the HyperLogLog structure in version 2.8.9.
Redis data structure HyperLogLog
Redis HyperLogLog is an algorithm used to perform cardinal statistics. The advantage of HyperLogLog is that when the number or size of input elements is very large, the space required to calculate the cardinality is always fixed and small. In Redis, each HyperLogLog key takes only 12 KB of memory to calculate the cardinality of nearly 2^64 different elements. This is in stark contrast to the more elements that consume more memory when using the set to calculate the cardinality. However, because HyperLogLog only calculates the cardinality based on the input elements, not the input elements themselves, HyperLogLog cannot return the input elements as a collection.
What is the cardinality?
For example, the data set {1, 3, 5, 7, 5, 7, 8}, then the base set of this data set is {1, 3, 5, 7, 8}, and the cardinality (non-repeating element) is 5. The cardinality estimate is to quickly calculate the cardinality within the acceptable range of the error.
Estimate: The cardinality given by the algorithm is not exact and may be slightly more or slightly less than actual, but it will be controlled within a reasonable range.
Several ordersAdd elements to HyperLogLog
1, PFADD key element [element ...]
Add any number of elements to the specified HyperLogLog.
This command may modify the HyperLogLog to reflect the new cardinality estimate. If the base estimate of HyperLogLog changes after the command is executed, the command returns 1 otherwise returns 0.
The complexity of the command is O(N) and N is the number of elements added.
2, PFCOUNT key [key ...]
Returns the cardinality estimate for a given HyperLogLog.
When only one HyperLogLog is given, the command returns the cardinality estimate for the given HyperLogLog.
When multiple HyperLogLogs are given, the command will first calculate the union of the given HyperLogLog, and then get a merged HyperLogLog, and then return the cardinality estimate of the merged HyperLogLog as the result of the command (the merged HyperLogLog will not It is stored and will be deleted after use).
When the command acts on a single HyperLogLog, the complexity is O(1) and has a very low average constant time.
When the command is applied to multiple HyperLogLogs, the complexity is O(N) and the constant time is much larger than when processing a single HyperLogLog.
Suizhou simi intelligent technology development co., LTD , https://www.msmsmart.com