Re: PoC: Using Count-Min Sketch for join cardinality estimation

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PoC: Using Count-Min Sketch for join cardinality estimation
Date: 2021-06-16 23:31:55
Message-ID: CAFBsxsE3wQ4J5t-UBPQK06=PVfvowq3dFoNCod-cE9M3yo8n4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 16, 2021 at 12:23 PM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> The attached patch is a very simple (and perhaps naive) implementation
> adding count-min sketch to pg_statistic for all attributes with a hash
> function (as a new statistics slot kind), and considering it in
> equijoinsel_inner. There's a GUC use_count_min_sketch to make it easier
> to see how it works.

Cool! I have some high level questions below.

> So it's about 4x over-estimated, while without the count-min sketch it's
> about 2x under-estimated:

> The nice thing on count-min sketch is that there are pretty clear
> boundaries for error:
>
> size(t1,t2) <= dot_product(s1,2) <= epsilon * size(t1) * size(t2)
>
> where s1/s2 are sketches on t1/t2, and epsilon is relative error. User
> may pick epsilon, and that determines size of the necessary sketch as
> 2/epsilon. So with 128 buckets, the relative error is ~1.6%.
>
> The trouble here is that this is relative to cartesian product of the
> two relations. So with two relations, each 30k rows, the error is up to
> ~14.5M. Which is not great. We can pick lower epsilon value, but that
> increases the sketch size.

+ * depth 8 and width 128 is sufficient for relative error ~1.5% with a
+ * probability of approximately 99.6%

Okay, so in the example above, we have a 99.6% probability of having less
than 14.5M, but the actual error is much smaller. Do we know how tight the
error bounds are with some lower probability?

> There's a bunch of other open questions:
>
> 1) The papers about count-min sketch seem to be written for streaming
> use cases, which implies all the inserted data pass through the sketch.
> This patch only builds the sketch on analyze sample, which makes it less
> reliable. I doubt we want to do something different (e.g. because it'd
> require handling deletes, etc.).

We currently determine the sample size from the number of histogram buckets
requested, which is from the guc we expose. If these sketches are more
designed for the whole stream, do we have any idea how big a sample we need
to be reasonably accurate with them?

> 2) The patch considers the sketch before MCVs, simply because it makes
> it much simpler to enable/disable the sketch, and compare it to MCVs.
> That's probably not what should be done - if we have MCVs, we should
> prefer using that, simply because it determines the frequencies more
> accurately than the sketch. And only use the sketch as a fallback, when
> we don't have MCVs on both sides of the join, instead of just assuming
> uniform distribution and relying on ndistinct.

> Anyway, count-min sketches would be a better way to estimate the part
> not covered by MCVs - we might even assume the uniform distribution for
> individual counters, because that's what we do without MCVs anyway.

When we calculate the sketch, would it make sense to exclude the MCVs that
we found? And use both sources for the estimate?

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-06-16 23:40:18 Re: A qsort template
Previous Message Andres Freund 2021-06-16 23:17:27 Re: Outdated replication protocol error?