Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
Date: 2022-05-19 17:59:21
Message-ID: CAEze2Wj6AP+xU0PxXNoqC7SB0PO=B7Y45NNMdS9hc_Jz2jqjgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 16 May 2022 at 00:09, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:
>
> On 5/15/22 21:55, Matthias van de Meent wrote:
> > Note: I am not (currently) planning on implementing this rough idea,
> > just putting it up to share and document the idea, on request of Tomas
> > (cc-ed).
> >
> > The excellent pgconf.de presentation on PostgreSQL's extended
> > statistics system by Tomas Vondra [0] talked about how the current
> > default statistics assume the MCVs of columns to be fully independent,
> > i.e. values of column A do not imply any value of columns B and C, and
> > that for accurate data on correllated values the user needs to
> > manually create statistics on the combined columns (by either
> > STATISTICS or by INDEX).
> >
> > This is said to be due to limitations in our statistics collector: to
> > determine the fraction of the table that contains the value, we store
> > the N most common values with the fraction of their occurrance in the
> > table. This value is quite exact, but combining these values proves
> > difficult: there is nothing in the stored value that can confidently
> > include or exclude parts of the table from a predicate using that MCV,
> > so we can only assume that the values of two columns are independent.
> >
> > After the presentation it came to me that if we were to add an
> > estimator for the number of rows with that value to the MCV lists in
> > the form of HLL sketches (in addition to or replacing the current
> > most_common_elem_freqs fractions), we would be able to make better
> > estimates for multi-column filters by combining the HLL row
> > cardinality sketches for filters that filter on these MCVs. This would
> > remove the immediate need for manual statistics with an cartesian
> > product of the MCVs of those columns with their occurrance fractions,
> > which significantly reduces the need for the creation of manual
> > statistics - the need that exists due to planner mis-estimates in
> > correlated columns. Custom statistics will still be required for
> > expression statistics, but column correlation estimations _within
> > MCVs_ is much improved.
> >
> > How I imagine this would work is that for each value in the MCV, an
> > HLL is maintained that estimates the amount of distinct tuples
> > containing that value. This can be h(TID) or h(PK), or anything else
> > that would uniquely identify returned tuples. Because the keyspace of
> > all HLLs that are generated are on the same table, you can apply join
> > and intersection operations on the HLLs of the MCVs (for OR and
> > AND-operations respectively), and provide fairly accurately estimates
> > for the amount of tuples that would be returned by the filter on that
> > table.
> > > The required size of the HLL sketches can be determined by the amount
> > of tuples scanned during analyze, potentially reducing the size
> > required to store these HLL sketches from the usual 1.5kB per sketch
> > to something smaller - we'll only ever need to count nTuples distinct
> > values, so low values for default_statistics_target would allow for
> > smaller values for m in the HLL sketches, whilst still providing
> > fairly accurate result estimates.
> >
>
> I think it's an interesting idea. In principle it allows deducing the
> multi-column MCV for arbitrary combination of columns, not determined in
> advance. We'd have the MCV with HLL instead of frequencies for columns
> A, B and C:
>
> (a1, hll(a1))
> (a2, hll(a2))
> (...)
> (aK, hll(aK))
>
>
> (b1, hll(b1))
> (b2, hll(b2))
> (...)
> (bL, hll(bL))
>
> (c1, hll(c1))
> (c2, hll(c2))
> (...)
> (cM, hll(cM))
>
> and from this we'd be able to build MCV for any combination of those
> three columns.
>
> And in some sense it might even be more efficient/accurate, because the
> MCV on (A,B,C) might have up to K*L*M items. if there's 100 items in
> each column, that'd be 1,000,000 combinations, which we can't really
> store (target is up to 10k). And even if we could, it'd be 1M
> combinations with frequencies (so ~8-16B per combination).
>
> While with the MCV/HLL, we'd have 300 items and HLL. Assuming 256-512B
> HLL would be enough, that's still way smaller than the multi-column MCV.

HLLs for statistics_target=100 could use 4 bits per bucket, but any target
above 218 should use 5 bits: nbits = ceil(log2(log2(target * 300))), and
this saves only 20% on storage.

Accuracy increases with root(m), so while we can shrink the amount of
buckets, this is only OK if we're accepting the corresponding decrease in
accuracy.

> Even with target=10k it'd still be cheaper to store the separate MCV
> with HLL values, if I count right, and there'd be no items omitted from
> the MCV.

There are more options, though: Count-min was proposed as a replacement for
MCV lists, and they work as guaranteed max count of distinct values. If,
instead of an actual counter in each bucket, we would use HLL in each
bucket, we'd not only have occurance estimates (but not: actual max-count
limits) for all values, but we'd also be able to do the cross-column result
correlation estimations.

> > Kind regards,
> >
> > Matthias van de Meent
> >
> > PS: Several later papers correctly point out that HLL can only count
> > up to 2^32 due to the use of a hash function that outputs only 32
> > bits; which is not enough for large tables. HLL++ solves this by using
> > a hash function that outputs 64 bits, and can thus be considered a
> > better alternative which provides the same features. But, any other
> > sketch that provides an accurate (but not necessarily: perfect)
> > count-distinct of which results can be combined should be fine as
> > well.
> >
>
> I don't think the 32-bit limitation is a problem for us, because we'd be
> only ever build HLL on a sample, not the whole table. And the samples
> are limited to 3M rows (with statistics target = 10k), so we're nowhere
> near the scale requiring 64-bit hashes.

I was thinking towards incremental statistics with this, which would imply
that inserts and updates do trigger updates in the statistics. It is fairly
inexpensive to append to an HLL, whereas that is not so trivial for
tablefraction-based statistics.

> Presumably the statistics target value would determine the necessary HLL
> parameters (and size), because e.g. with 30k rows we can't possibly see
> more than 30k distinct values.
>
> One possible problem is this all works only when all the columns are
> analyzed at the same time / using the same sample. If you do this:
>
> ANALYZE t(a);
> ANALYZE t(b);
>
> then HLL filters sketches for the columns would use different ctid/PK
> values, and hence can't be combined.

Correct. This could be improved through deterministic random sample (as
opposed to the current PRNG seeded with random()), where the sampled set is
only pseudo-random and always the same for a given relation of constant
size.

This would result in statistics that always overlap while the relation size
is constant, and still show a corellation when the relation size changes
(with correllation = (1 - %delta) * (1 - %delta) ). Only in the worst cases
the correlation would be non-existent and the resulting combinations would
be no different than random - effectively regressing the estimator function
to P(A & B) = P(A) * P(B), which is the same as what we currently have.

Example: pages could be sampled in order of increasing value of hash(PageNo
|| relid). This hash can be anything, but a reversible hash function would
probably be best, because it helps us by not having to sort nBlocks hashed
values for large tables (we can run the equivalent of [0], which might be
cheaper than top-n-sorting an array of relsize blocks). The resulting set
of scanned data would be stable for any unchanging relation size and would
thus consistently select the same blocks when you analyze the table.

Regards,

Matthias

PS. I was looking through papers on different types of sketches, and found
that (according to some papers) HLL has some severe drawbacks regarding
sketch intersection estimates. The Apache DataSketches [1] project states
that it can provide accurate combined estimates through the Theta sketch
[2][3] - an adaptation of the KMV-sketch that does seem to provide accurate
union, intersection and 'A not B' -estimates, as well as exact results for
low input row counts - something that HLL also doesn't work well for.

There is a patent covering this (US9152691, expected to expire in 2033),
and the project containing the code is licenced under Apache 2.

[0] SELECT reverse_hash(n) FROM generate_series(0, MaxBlockNumber, 1) t(n)
WHERE reverse_hash(n) < tablesize LIMIT n_sample_blocks"
[1]
https://datasketches.apache.org/docs/Architecture/SketchFeaturesMatrix.html
[2]
https://raw.githubusercontent.com/apache/datasketches-website/master/docs/pdf/ThetaSketchFramework.pdf
[3]
https://raw.githubusercontent.com/apache/datasketches-website/master/docs/pdf/ThetaSketchEquations.pdf

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Israa Odeh 2022-05-19 18:12:39 Inquiring about my GSoC Proposal.
Previous Message Nikolay Shaplov 2022-05-19 17:52:23 Re: Zstandard support for toast compression