Quick Links

Re: POC: GROUP BY optimization

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: POC: GROUP BY optimization
Date:	2018-06-06 22:22:48
Message-ID:	CAGTBQpZQyZrj7mfcNdcSSJ64pZ8UtfhOn89MswJUM4fcW4bS5Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jun 6, 2018 at 7:18 PM Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> > > Comparison cost can be approximated probabilistically as:
> > >
> > > cost_comp = sum(cost_op_n * (1.0 / ndistinct(col_1_to_n)))
> > >
> > > Where cost_op_n is the cost of the comparison function for column N,
> > > and ndistinct(col_1_to_n) is an estimation of the number of distinct
> > > values for columns prior to N in the sort order.
> > >
> > > You can approximate ndistinct as the product of ndistinct of previous
> > > columns, or use extended statistics when available.
> > >
> >
> > Sure. The trouble is this also assumes uniform distribution, which is
> > not always the case.
>
> Well, (1.0 / ndistinct) = p(left == right).
>
> If you can compute a better p(left == right) with an MCV, you can get
> a better estimate. If you have an MCV. But that'd be a bit expensive,
> and I'm not sure it's all that relevant.
>
> To what degree does improving this produce better plans? As long as
> average expected cost is relatively well estimated (as in, one sort
> order relative to another sort order), it won't affect the decision.
>
> But if needed, the MCV can be used for this.
>
> So, in essence, you need to account for:
>
> - restrictions on that column that constrain the domain
> - distribution skew (MCV, nulls)
> - ndistinct
>
> To compute p(left == right).
>
> Maybe something as simple as the following?
>
> p_special = sum(mcv_i * mcv_i) + null_frac * null_frac
> p_other = (1 - p_special) * (1 - p_special) / ndistinct(restr)

Well, that came out with a slew of math errors, but the idea is clear:
compute p(left == right) given the available statistics and
constrained by any applicable restrictions.

In response to

Re: POC: GROUP BY optimization at 2018-06-06 22:18:25 from Claudio Freire

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	MauMau	2018-06-06 22:38:57	Re: I'd like to discuss scaleout at PGCon
Previous Message	Claudio Freire	2018-06-06 22:18:25	Re: POC: GROUP BY optimization