title: "Percentile calculation" description: "How to use histograms to efficiently calculate percentiles in Materialize." aliases:
Percentiles are a useful statistic to understand and interpret data distribution. This pattern covers how to use histograms to efficiently calculate percentiles in Materialize.
One way to compute percentiles is to order all values and pick the value at the position of the corresponding percentile. However, this approach requires storing all values, causing memory to grow linearly with the number of tracked values.
Instead, more memory efficient alternatives are to use histograms and High Dynamic Range (HDR) histograms:
Histograms have a lower memory footprint that is linear to the number of unique values and can compute precise percentiles. However, for domains with high cardinality, calculating precise percentiles may be computationally expensive.
HDR histograms further reduce the memory footprint but computes approximate percentiles. Depending on the precision needed for the percentiles, HDR histograms may be preferred for domains with a high cardinality and dynamic range of values.
Histograms summarize data sets by grouping values into ranges and counting how many elements fall into each range. From this summary, you can get the percentile information by identifying the range where the cumulative count crosses the desired percentile threshold. By grouping each distinct value into its own range, you can get exact percentiles; however, this can be computationally expensive if there are large number of distinct values. Alternatively, you can get an approximate percentiles by using HDR histograms.
To use histograms to compute exact percentiles:
First, create a histogram view that groups each distinct value into its own bucket and counts the number of each distinct value.
Then, using a cross join on the histogram view, create a distribution view that calculates the cumulative density for a bucket by dividing the cumulative counts (sum of the counts for all buckets up to and including that bucket) by the total count.
{{< note >}}
The use of the cross join produces a number of outputs that is quadratic in the input. And, while the results will only be linear in size, it may take a disproportionate amount of time to produce and maintain.
{{</ note >}}
Create a table input
:
CREATE TABLE input (value BIGINT);
Insert into the input
table values 1
to 10
.
INSERT INTO input
SELECT n FROM generate_series(1,10) AS n;
Create a histogram
view to track unique values from the
input
table and their count:
CREATE VIEW histogram AS
SELECT
value AS bucket,
count(*) AS count_of_bucket_values
FROM input
GROUP BY value;
Create a view distribution
to calculate the cumulative count and the
cumulative density for each bucket. The cumulative density is calculated by
dividing the cumulative count for a bucket (i.e., count for all bucket values
up to and including that bucket) by the total count.
CREATE VIEW distribution AS
SELECT
h.bucket,
h.count_of_bucket_values,
sum(g.count_of_bucket_values) AS cumulative_count,
sum(g.count_of_bucket_values) / (SELECT sum(count_of_bucket_values) FROM histogram) AS cumulative_density
FROM histogram g, histogram h
WHERE g.bucket <= h.bucket
GROUP BY h.bucket, h.count_of_bucket_values
ORDER BY cumulative_density;
{{< note >}}
The use of the cross join produces a number of outputs that is quadratic in the input. And, while the results will only be linear in size, it may take a disproportionate amount of time to produce and maintain.
{{</ note >}}
You can then query distribution
by the cumulative_density
field to
return specific percentiles. For example, the following query returns the
90-th percentile.
SELECT bucket AS percentile90
FROM distribution
WHERE cumulative_density >= 0.9
ORDER BY cumulative_density
LIMIT 1;
HDR histograms can be used to approximate percentiles in a space efficient manner that scales well even for large domains with many distinct values. HDR histograms reduce the precision of values that are tracked and use buckets with variable width. Buckets that are closer to 0 are smaller whereas buckets far away from 0 are wider. This works particularly well for data that exhibits a long tail of large values, e.g., latency measurements.
HDR histograms are related to how floating point numbers are represented as integers. The underlying assumption is that smaller numbers require a higher precision to be distinguishable (e.g. 5 ms and 6 ms are different and should be in different buckets) whereas larger numbers can be rounded more aggressively as their relative error becomes less relevant (e.g. 10000 ms and 10001 ms are almost the same and can reside in the same bucket).
In the example below, to reduce the number of buckets, the values are first
decomposed into significand * 2^exponent
, and then with the precision of the
significand lowered, reconstructed for the respective bucket value.
With higher precisions, fewer items are kept in the same bucket and thus, more memory is required, but the approximate percentile becomes more precise.
With lower precisions, more items are kept in the same bucket, and thus, the less memory is required, but the approximate percentile becomes less precise.
Except for the bucket calculation, the basic ideas of using histograms to compute percentiles remains the same for HDR histograms.
{{< tip >}}
The following example assumes you have not previously created and populated the
input
table from the Using histograms to compute exact percentiles
example. If you have created and populated the table, skip the
corresponding steps.
{{</ tip >}}
Create a table input
:
CREATE TABLE input (value BIGINT);
Insert into the input
table values 1
to 10
.
INSERT INTO input
SELECT n FROM generate_series(1,10) AS n;
Create a hdr_histogram
view. To reduce the number of buckets, the values
are rounded down to the nearest multiple of 1/16. Specifically, the values
are first decomposed into significand * 2^exponent
. Then by reducing the
precision of the significand to 1/16 (4 bits), the value is reconstructed to
an approximated value.
{{< tabs >}}
{{< tab "Materialize Console" >}}
CREATE VIEW hdr_histogram AS
WITH
input_parts AS (
SELECT
CASE WHEN value = 0 THEN NULL
ELSE trunc(log(2, abs(value)))::int
END AS exponent,
CASE WHEN value = 0 THEN NULL
ELSE value / pow(2.0, trunc(log(2, abs(value)))::int)
END AS significand
FROM input
),
buckets AS (
-- reduce precision by 4 bits to round down the value to the nearest multiple of 1/16
SELECT
trunc(significand * pow(2.0, 4)) / pow(2.0, 4)
* pow(2.0, exponent)
AS bucket
FROM input_parts
)
SELECT
COALESCE(bucket, 0) AS bucket,
count(*) AS count_of_bucket_values
FROM buckets
GROUP BY bucket;
{{</ tab >}}
{{< tab "psql" >}}
-- precision for the representation of the significand in bits
\set precision 4
CREATE VIEW hdr_histogram AS
WITH
input_parts AS (
SELECT
CASE WHEN value = 0 THEN NULL
ELSE trunc(log(2, abs(value)))::int
END AS exponent,
CASE WHEN value = 0 THEN NULL
ELSE value / pow(2.0, trunc(log(2, abs(value)))::int)
END AS significand
FROM input
),
buckets AS (
-- reduce precision by 4 bits to round down the value to the nearest multiple of 1/16
SELECT
trunc(significand * pow(2.0, :precision)) / pow(2.0, :precision)
* pow(2.0, exponent)
AS bucket
FROM input_parts
)
SELECT
COALESCE(bucket, 0) AS bucket,
count(*) AS count_of_bucket_values
FROM buckets
GROUP BY bucket;
{{</ tab >}} {{</ tabs >}}
Create a view hdr_distribution
to calculate the cumulative count and the
cumulative density for each bucket. The cumulative density is calculated by
dividing the cumulative count for a bucket (i.e., count for all bucket values
up to and including that bucket) by the total count.
CREATE VIEW hdr_distribution AS
SELECT
h.bucket,
h.count_of_bucket_values,
sum(g.count_of_bucket_values) AS cumulative_count,
sum(g.count_of_bucket_values) / (SELECT sum(count_of_bucket_values) FROM hdr_histogram) AS cumulative_density
FROM hdr_histogram g, hdr_histogram h
WHERE g.bucket <= h.bucket
GROUP BY h.bucket, h.count_of_bucket_values;
You can then query hdr_distribution
by the cumulative_density
field
to return approximate percentiles. More precisely, the query returns the
lower bound for the percentile (the next larger bucket represents the upper
bound).
For example, the following query returns the lower bound for the 90-th percentile.
SELECT bucket AS approximate_percentile
FROM hdr_distribution
WHERE cumulative_density >= 0.9
ORDER BY cumulative_density
LIMIT 1;
For small numbers, distribution
and hdr_distribution
are identical. Even in
hdr_distribution
, all numbers from 1 to 10 are stored in their own buckets. To
verify, query hdr_distribution
:
SELECT * FROM hdr_distribution;
The query returns the following:
bucket | frequency | cumulative_count | cumulative_density
--------+-----------+----------------------+-------------------------
1 | 1 | 1 | 0.1
2 | 1 | 2 | 0.2
3 | 1 | 3 | 0.3
4 | 1 | 4 | 0.4
5 | 1 | 5 | 0.5
6 | 1 | 6 | 0.6
7 | 1 | 7 | 0.7
8 | 1 | 8 | 0.8
9 | 1 | 9 | 0.9
10 | 1 | 10 | 1
(10 rows)
But if values grow larger, buckets can contain more than one value. Let's see what happens if more values are added to the input
table.
INSERT INTO input SELECT n FROM generate_series(11,10001) AS n;
Unlike the distribution
view (used in the histogram approach) where each
bucket contains only a single value and has 10001 rows, a single bucket in
hdr_distribution
can represent up to 512 distinct values and has 163 rows:
SELECT * FROM hdr_distribution ORDER BY cumulative_density;
The query returns the following:
bucket | frequency | cumulative_count | cumulative_density
--------+-----------+----------------------+-------------------------------------------
1 | 1 | 1 | 0.00000999990000099999000009999900001
2 | 1 | 2 | 0.00001999980000199998000019999800002
3 | 1 | 3 | 0.00002999970000299997000029999700003
4 | 1 | 4 | 0.00003999960000399996000039999600004
5 | 1 | 5 | 0.00004999950000499995000049999500005
...skipping...
7424 | 256 | 7679 | 0.767823217678232176782321767823217678232
7680 | 256 | 7935 | 0.793420657934206579342065793420657934207
7936 | 256 | 8191 | 0.819018098190180981901809819018098190181
8192 | 512 | 8703 | 0.87021297870212978702129787021297870213
8704 | 512 | 9215 | 0.921407859214078592140785921407859214079
9216 | 512 | 9727 | 0.972602739726027397260273972602739726027
9728 | 274 | 10001 | 1
(163 rows)
When querying hdr_distribution
for the 90-th percentile value:
SELECT bucket AS approximate_percentile
FROM hdr_distribution
WHERE cumulative_density >= 0.9
ORDER BY cumulative_density
LIMIT 1;
The query returns an approximate
percentile of 8704
(or more precisely between 8704
and 9216
) whereas the
precise percentile is 9001
.
approximate_percentile
------------------------
8704
(1 row)
The precision of the approximation can be adapted by changing the precision
in the definition of hdr_histogram
. The higher the precision
, the fewer items are kept in the same bucket and therefore the more precise the approximate percentile becomes. The lower the precision
, the more items are kept in the same bucket and therefore the less memory is required.