Re: distinct estimate of a hard-coded VALUES list

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: distinct estimate of a hard-coded VALUES list
Date: 2016-08-28 21:57:52
Message-ID: CAMkU=1w_sLaUHrBMmhCv3fHXYmO0yv+nRzQ58KN_A+XSx=7gRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 22, 2016 at 10:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sat, Aug 20, 2016 at 4:58 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> >> On Thu, Aug 18, 2016 at 2:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>> It does know it, what it doesn't know is how many duplicates there are.
> >
> >> Does it know whether the count comes from a parsed query-string
> list/array,
> >> rather than being an estimate from something else? If it came from a
> join,
> >> I can see why it would be dangerous to assume they are mostly distinct.
> >> But if someone throws 6000 things into a query string and only 200
> distinct
> >> values among them, they have no one to blame but themselves when it
> makes
> >> bad choices off of that.
> >
> > I am not exactly sold on this assumption that applications have
> > de-duplicated the contents of a VALUES or IN list. They haven't been
> > asked to do that in the past, so why do you think they are doing it?
>

I think most people wouldn't go out of their way to de-duplicate simply
because the way they get the list is inherently deduplicated in the first
place, or at least is not going to routinely have massive redundancy. If I
thought it was likely to have massive redundancy, then I would add code to
deduplicate just because I don't want to lug redundant data around inside
my own app server code, regardless of what it does to PostgreSQL's planner.

We can be sure that someone somewhere is sending huge lists which only a
couple hundred distinct values, but I don't think we should let that block
changes.

I think the more serious source of regressions will be cases where the
current bad distinct estimate is cancelling some other bad estimate
elsewhere in the planner, yielding an efficient plan by accident.
Improving the estimate could push the plan over to a bad choice, leading to
a large regression upon upgrading. For that reason, I think it is better
not to try to get your patch (which I tested and looks good to me, except
that it still isn't picking the best plan for unrelated reasons) into 9.6
after beta4 but just put it into 10.0. I don't know when most people with
stringent SLA start validating the performance of new versions, but I
suspect some are already doing so before beta4 and it wouldn't be nice to
change this on them.

>
> It's hard to know, but my intuition is that most people would
> deduplicate. I mean, nobody is going to want to their query generator
> to send X IN (1, 1, <repeat a zillion more times>) to the server if it
> could have just sent X IN (1).
>

Yes, I agree with that. But note that this isn't currently relevant to X
IN (1) anyway. 'X IN (list)' gets planned as if it were "X=ANY(ARRAY...)",
which goes through a different machinery than "X=ANY(VALUES...)"

There is a 9 year old to-do item which proposes (if I understand it
correctly) to change this so that "X=ANY(ARRAY...)" goes through the same
plan as "X=ANY(VALUES...)" A problem with doing that is currently people
can, in lieu of planner hints, re-write their queries between the two ways
of expressing them, in case one way gives a better plan. If they were
always planned identically, it would take away that safety hatch.

The most conservative thing, given the current state, would be to fix
"X=ANY(ARRAY...)" so that it builds a hash table out of the ARRAY, and then
probes that table for each X, rather than iterating over the ARRAY for each
X, which is what it currently does. That way you would get the benefit of
using a hash, without the risk of tipping over into some unsuitable join
order in the process. It should have no user-facing downsides, just the
ugliness of having two ways to implement fundamentally the same thing.

So basically this line of a plan:

Filter: (aid = ANY ('{...}')

Currently glosses over a nested loop over the array. We could make it
gloss over a hash probe into the "array" instead.

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-08-28 22:11:20 Reminder: 9.6rc1 wraps tomorrow
Previous Message Tom Lane 2016-08-28 21:45:38 Re: src/include/catalog/pg_foreign_table.h still refers genbki.sh