Re: PATCH: adaptive ndistinct estimator v4

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: adaptive ndistinct estimator v4
Date: 2015-05-01 01:20:15
Message-ID: 5542D4CF.5060700@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Hi,

On 04/30/15 22:57, Robert Haas wrote:
> On Tue, Mar 31, 2015 at 3:02 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> attached is v4 of the patch implementing adaptive ndistinct estimator.
>
> So, I took a look at this today. It's interesting work, but it looks
> more like a research project than something we can commit to 9.5. As
> far as I can see, this still computes the estimate the way we do
> today, but then also computes the estimate using this new method.
> The estimate computed the new way isn't stored anywhere, so this
> doesn't really change behavior, except for printing out a WARNING
> comparing the values produced by the two estimators.

I agree that this is not ready for 9.5 - it was meant as an experiment
(hence printing the estimate in a WARNING, to make it easier to compare
the value to the current estimator). Without that it'd be much more
complicated to compare the old/new estimates, but you're right this is
not suitable for commit.

So far it received only reviews from Jeff Janes (thanks!), but I think a
change like this really requires a more thorough review, including the
math part. I don't expect that to happen at the very end of the last CF
before the freeze.

> IMHO, the comments in this patch are pretty inscrutable. I believe
> this is because they presume more knowledge of what the patch is doing
> than I myself possess. For example:
>
> + * The AEE estimator is based on solving this equality (for "m")
> + *
> + * m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
> + *
> + * where A, B are effectively constants (not depending on m), and A(m)
> + * and B(m) are functions. This is equal to solving
> + *
> + * 0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)
>
> Perhaps I am just a dummy, but I have no idea what any of that means.
> I think that needs to be fixed so that someone who knows what
> n_distinct is but knows nothing about the details of these estimators
> can get an idea of how they are doing their thing without too much
> effort. I think a lot of the comments share this problem.

Well, I don't think you're dummy, but this requires reading the paper
describing the estimator. Explaining that fully would essentially mean
copying a large portion of the paper in the comment, and I suppose
that's not a good idea. The explanation might be perhaps a bit more
detailed, though - not sure what's the right balance.

> Aside from the problems mentioned above, there's the broader question
> of how to evaluate the quality of the estimates produced by this
> estimator vs. what we're doing right now. I see that Jeff Janes has
> pointed out some cases that seem to regress with this patch; those
> presumably need some investigation, or at least some comment. And I
> think some testing from other people would be good too, before we
> commit to this.

Yeah, evaluating is difficult. I think that Jeff's approach - i.e.
testing the estimator on real-world data sets - is the right approach
here. Testing on synthetic data sets has it's value too (if only to
better understand the failure cases).

> Leaving that aside, at some point, you'll say, "OK, there may be some
> regressions left but overall I believe this is going to be a win in
> most cases". It's going to be really hard for anyone, no matter how
> smart, to figure out through code review whether that is true. So
> committing this figures to be extremely frightening. It's just not
> going to be reasonably possible to know what percentage of users are
> going to be more happy after this change and what percentage are
> going to be less happy.

For every pair of estimators you can find cases where one of them is
better than the other one. It's pretty much impossible to find an
estimator that beats all other estimators on all possible inputs.

There's no way to make this an improvement for everyone - it will
produce worse estimates in some cases, and we have to admit that. If we
think this is unacceptable, we're effectively stuck with the current
estimator forever.

> Therefore, I think that:
>
> 1. This should be committed near the beginning of a release cycle,
> not near the end. That way, if there are problem cases, we'll have a
> year or so of developer test to shake them out.

+1

> 2. There should be a compatibility GUC to restore the old behavior.
> The new behavior should be the default, because if we're not
> confident that the new behavior will be better for most people, we
> have no business installing it in the first place (plus few people
> will try it). But just in case it turns out to suck for some people,
> we should provide an escape hatch, at least for a few releases.

I think a "compatibility GUC" is a damn poor solution, IMNSHO.

For example, GUCs are database-wide, but I do expect the estimator to
perform worse only on a few data sets / columns. So making this a
column-level settings would be more appropriate, I think.

But it might work during the development cycle, as it would make
comparing the estimators possible (just run the tests with the GUC set
differently). Assuming we'll re-evaluate it at the end, and remove the
GUC if possible.

>
> 3. There should be some clear documentation in the comments indicating
> why we believe that this is a whole lot better than what we do today.
> Maybe this has been discussed adequately on the thread and maybe it
> hasn't, but whoever goes to look at the committed code should not have
> to go root through hackers threads to understand why we replaced the
> existing estimator. It should be right there in the code. If,
> hypothetically speaking, I were to commit this, and if, again strictly
> hypothetically, another distinguished committer were to write back a
> year or two later, "clearly Robert was an idiot to commit this because
> it's no better than what we had before" then I want to be able to say
> "clearly, you have not what got committed very carefully, because the
> comment for function <blat> clearly explains that this new technology
> is teh awesome".

I certainly can add such comment to the patch ;-) Choose a function.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-05-01 01:24:14 Re: PATCH: adaptive ndistinct estimator v4
Previous Message Noah Misch 2015-05-01 01:17:52 Re: Final Patch for GROUPING SETS

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2015-05-01 01:24:14 Re: PATCH: adaptive ndistinct estimator v4
Previous Message Robert Haas 2015-04-30 22:18:43 Re: PATCH: adaptive ndistinct estimator v4