Re: Parameterized-path cost comparisons need some work

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-02-29 18:12:08
Message-ID: CA+Tgmob8OGcdA1vYzFJP-SWoiyN16rE-i6wUVSALTSF=XQQw5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 28, 2012 at 5:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The most obvious thing to do about this is to consider that one path can
> dominate another on cost only if it is both cheaper *and* produces no
> more rows.

Hmm. Are you sure that's the right rule? I am having trouble
constructing an example, but I feel like there might be cases where
it's possible to have path A, not parameterized, path B, parameterized
by R, and path C, parameterized by S, and maybe also path D,
parameterized by both R and S. In that case, I feel like paths B and
C are incomparable. Even if one is cheaper than the other and also
produces fewer rows, they're not the same rows; so the expense of a
subsequent join might be different with one vs. the other.

Even in the simple case where there's only one possible
parameterization, it seems very difficult to compare the cost of a
parameterized path A with the cost of an unparameterized path B. No
matter how much cheaper A is than B, the eventual nested loop join
might be so inefficient as to completely wipe A out. On the flip
side, even if A is more expensive than B, it's nearly always going to
produce at least somewhat fewer rows. There are degenerate cases
where that might not be true, like a parameterized join where the
table we're index-scanning has only one value in the paramaterized
column, but presumably that's rare. So it may be worth keeping A
around just because the subsequent join could turn out to be much
cheaper with the smaller row count. Thus it seems unlikely that we'll
be able to conclude that either A or B is categorically superior.

If you accept that reasoning, then it seems like there's little if any
point in ever comparing a parameterized path to a non-parameterized
path; or of comparing two differently-parameterized paths against each
other. You basically need a separate bucket for each group of paths,
where they compete against each other but not against
differently-parameterized paths; and then after you form the nested
loop, all the paths that are now parameterized the same way (but
weren't before) can finally go head-to-head against each other.

I almost wonder if the bottom-up approach is the wrong way to tackle
this entirely. Suppose we're trying to build paths for {A B}. We
first build unparameterized paths for {A} and {B} and compute join
paths for {A B}. Now we know the cheapest way to build {A B} without
using parameterization, so we can make some deductions about a plan of
the form:

Nested Loop
-> A
-> B (parameterized by A)

In particular, if we take the best cost either overall or for a path
with the pathkeys that will be produced by this join, and divide by
the number of rows for A, a parameterized path for B only makes sense
if the total cost is less than that value. Now we have a meaningful
bound that we can use to limit consideration of paths for B: anything
that's more expensive than that bound should be chucked. If B is just
a single rel that's not that interesting, but if B is a joinrel, then
the bound applies individually to any subset of its members. So we
start replanning B as a separate join problem, but any path for any
baserel or joinrel that exceeds our cutoff cost gets chucked; and if
any rel ends up with no workable paths, then we just forget the whole
thing.

Of course, the problem with this approach (aside from complexity) is
that you might end up planning the same subproblem multiple times with
different cost bounds. You could cache the results of previous
computations so that you only replan it when the cost bound goes up,
but that's still pretty awful. So maybe this is unworkable. But the
current approach seems problematic, too, because you don't really know
enough to throw very much away at the time that you need to make that
decision.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Browne 2012-02-29 18:14:04 Re: COPY with hints, rebirth
Previous Message Simon Riggs 2012-02-29 17:54:38 Re: 16-bit page checksums for 9.2