Re: disfavoring unparameterized nested loops

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: disfavoring unparameterized nested loops
Date: 2021-06-21 16:42:20
Message-ID: 0788797e-d281-c626-feda-b8946ede5d68@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/21/21 5:55 PM, Tom Lane wrote:
> Peter Geoghegan <pg(at)bowt(dot)ie> writes:
>> The heuristic that has the optimizer flat out avoids an
>> unparameterized nested loop join is justified by the belief that
>> that's fundamentally reckless. Even though we all agree on that much,
>> I don't know when it stops being reckless and starts being "too risky
>> for me, but not fundamentally reckless". I think that that's worth
>> living with, but it isn't very satisfying.
>
> There are certainly cases where the optimizer can prove (in principle;
> it doesn't do so today) that a plan node will produce at most one row.
> They're hardly uncommon either: an equality comparison on a unique
> key, or a subquery with a simple aggregate function, come to mind.
>
> In such cases, not only is this choice not reckless, but it's provably
> superior to a hash join. So in the end this gets back to the planning
> risk factor that we keep circling around but nobody quite wants to
> tackle.
>

Agreed.

> I'd be a lot happier if this proposal were couched around some sort
> of estimate of the risk of the outer side producing more than the
> expected number of rows. The arguments so far seem like fairly lame
> rationalizations for not putting forth the effort to do that.
>
I agree having such measure would be helpful, but do you have an idea
how it could be implemented?

I've been thinking about this a bit recently and searching for papers
talking about this, and but it's not clear to me how to calculate the
risk (and propagate it through the plan) without making the whole cost
evaluation way more complicated / expensive :-(

The "traditional approach" to quantifying risk would be confidence
intervals, i.e. for each cardinality estimate "e" we'd determine a range
[a,b] so that P(a <= e <= b) = p. So we could pick "p" depending on how
"robust" the plan choice should be (say 0.9 for "risky" and 0.999 for
"safe" plans) and calculate a,b. Or maybe we could calculate where the
plan changes, and then we'd see if those "break points" are within the
confidence interval. If not, great - we can assume the plan is stable,
otherwise we'd need to consider the other plans too, somehow.

But what I'm not sure about is:

1) Now we're dealing with three cardinality estimates (the original "e"
and the boundaries "a, "b"). So which one do we use to calculate cost
and pass to upper parts of the plan?

2) The outer relation may be a complex join, so we'd need to combine the
confidence intervals for the two input relations, somehow.

3) We'd need to know how to calculate the confidence intervals for most
plan nodes, which I'm not sure we know how to do. So it's not clear to
me how to do this, which seems rather problematic because we need to
propagate and combine those confidence intervals through the plan.

But maybe you have thought about some much simpler approach, addressing
this sufficiently well?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2021-06-21 16:43:25 Re: Add version macro to libpq-fe.h
Previous Message Tom Lane 2021-06-21 16:34:00 Re: Add version macro to libpq-fe.h