Quick Links

Re: PoC: Using Count-Min Sketch for join cardinality estimation

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Using Count-Min Sketch for join cardinality estimation
Date:	2021-06-18 20:24:45
Message-ID:	bf65f065-1eb9-b8ef-5623-32539139b1d8@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 6/18/21 9:54 PM, John Naylor wrote:
>
> On Fri, Jun 18, 2021 at 3:43 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com <mailto:tomas(dot)vondra(at)enterprisedb(dot)com>>
> wrote:
>
> > Sorry, I'm not sure what you mean by "we set the number of MCVs to the
> > number of histograms" :-(
> >
> > When you say "MCV limit" you mean that we limit the number of items to
> > statistics target, right? I agree plan time is one concern - but it's
> > also about analyze, as we need larger sample to build a larger MCV or
> > histogram (as the paper you referenced shows).
>
> Ah, I didn't realize the theoretical limit applied to the MCVs too, but
> that makes sense since they're basically singleton histogram buckets.
>

Something like that, yes. Looking at MCV items as singleton histogram
buckets is interesting, although I'm not sure that was the reasoning
when calculating the MCV size. AFAIK it was kinda the other way around,
i.e. the sample size is derived from the histogram paper, and when
building the MCV we ask what's sufficiently different from the average
frequency, based on the sample size etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: PoC: Using Count-Min Sketch for join cardinality estimation at 2021-06-18 19:54:40 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2021-06-18 20:31:43	Re: pg_stats and range statistics
Previous Message	Jeff Davis	2021-06-18 19:55:17	Re: A few nuances about specifying the timeline with START_REPLICATION