Re: PATCH: adaptive ndistinct estimator v4

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: adaptive ndistinct estimator v4
Date: 2015-05-13 21:07:47
Message-ID: CAMkU=1yR=i5F1RnmJsB9C-iEbTE3J1sR+4mRbcPy6mwCotPxfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Thu, Apr 30, 2015 at 6:20 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> Hi,
>
> On 04/30/15 22:57, Robert Haas wrote:
>
>> On Tue, Mar 31, 2015 at 3:02 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>> attached is v4 of the patch implementing adaptive ndistinct estimator.
>>>
>>
>> So, I took a look at this today. It's interesting work, but it looks
>> more like a research project than something we can commit to 9.5. As
>> far as I can see, this still computes the estimate the way we do
>> today, but then also computes the estimate using this new method.
>> The estimate computed the new way isn't stored anywhere, so this
>> doesn't really change behavior, except for printing out a WARNING
>> comparing the values produced by the two estimators.
>>
>
> I agree that this is not ready for 9.5 - it was meant as an experiment
> (hence printing the estimate in a WARNING, to make it easier to compare
> the value to the current estimator). Without that it'd be much more
> complicated to compare the old/new estimates, but you're right this is
> not suitable for commit.
>

With the warning it is very hard to correlate the discrepancy you do see
with which column is causing it, as the warnings don't include table or
column names (Assuming of course that you run it on a substantial
database--if you just run it on a few toy cases then the warning works
well).

If we want to have an explicitly experimental patch which we want people
with interesting real-world databases to report back on, what kind of patch
would it have to be to encourage that to happen? Or are we never going to
get such feedback no matter how friendly we make it? Another problem is
that you really need to have the gold standard to compare them to, and
getting that is expensive (which is why we resort to sampling in the first
place). I don't think there is much to be done on that front other than
bite the bullet and just do it--perhaps only for the tables which have
discrepancies.

Some of the regressions I've seen are at least partly a bug:

+ /* find the 'm' value minimizing the difference */
+ for (m = 1; m <= total_rows; m += step)
+ {
+ double q = k / (sample_rows * m);

sample_rows and m are both integers, and their product overflows
vigorously. A simple cast to double before the multiplication fixes the
first example I produced. The estimate goes from 137,177 to 1,108,076.
The reality is 1,062,223.

Perhaps m should be just be declared a double, as it is frequently used in
double arithmetic.

> Leaving that aside, at some point, you'll say, "OK, there may be some
>> regressions left but overall I believe this is going to be a win in
>> most cases". It's going to be really hard for anyone, no matter how
>> smart, to figure out through code review whether that is true. So
>> committing this figures to be extremely frightening. It's just not
>> going to be reasonably possible to know what percentage of users are
>> going to be more happy after this change and what percentage are
>> going to be less happy.
>>
>
> For every pair of estimators you can find cases where one of them is
> better than the other one. It's pretty much impossible to find an estimator
> that beats all other estimators on all possible inputs.
>
> There's no way to make this an improvement for everyone - it will produce
> worse estimates in some cases, and we have to admit that. If we think this
> is unacceptable, we're effectively stuck with the current estimator forever.
>
> Therefore, I think that:
>>
>> 1. This should be committed near the beginning of a release cycle,
>> not near the end. That way, if there are problem cases, we'll have a
>> year or so of developer test to shake them out.
>>
>
It can't hurt, but how effective will it be? Will developers know or care
whether ndistinct happened to get better or worse while they are working on
other things? I would think that problems will be found by focused
testing, or during beta, and probably not by accidental discovery during
the development cycle. It can't hurt, but I don't know how much it will
help.

> 2. There should be a compatibility GUC to restore the old behavior.
>> The new behavior should be the default, because if we're not
>> confident that the new behavior will be better for most people, we
>> have no business installing it in the first place (plus few people
>> will try it). But just in case it turns out to suck for some people,
>> we should provide an escape hatch, at least for a few releases.
>>
>
> I think a "compatibility GUC" is a damn poor solution, IMNSHO.
>

> For example, GUCs are database-wide, but I do expect the estimator to
> perform worse only on a few data sets / columns. So making this a
> column-level settings would be more appropriate, I think.
>
> But it might work during the development cycle, as it would make comparing
> the estimators possible (just run the tests with the GUC set differently).
> Assuming we'll re-evaluate it at the end, and remove the GUC if possible.

I agree with the "experimental GUC". That way if hackers do happen to see
something suspicious, they can just turn it off and see what difference it
makes. If they have to reverse out a patch from 6 months ago in an area of
the code they aren't particularly interested in and then recompile their
code and then juggle two different sets of binaries, they will likely just
shrug it off without investigation.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Seçkin Alan 2015-05-13 21:11:15 pgAdmin4 Bug fix or my Fault ?
Previous Message Andres Freund 2015-05-13 20:51:15 Re: Final Patch for GROUPING SETS

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2015-05-15 18:30:02 Re: PATCH: adaptive ndistinct estimator v4
Previous Message Muthusamy, Sivaraman 2015-05-11 09:55:06 How to clean/truncate / VACUUM FULL pg_largeobject without (much) downtime?