Re: The science of optimization in practical terms?

From: Sam Mason <sam(at)samason(dot)me(dot)uk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: The science of optimization in practical terms?
Date: 2009-02-18 15:13:44
Message-ID: 20090218151344.GQ32672@frubble.xen.chris-lamb.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 18, 2009 at 01:34:25AM -0500, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > I'm interested to know whether anyone else shares my belief that
> > nested loops are the cause of most really bad plans. What usually
> > happens to me is that the planner develops some unwarranted optimism
> > about the number of rows likely to be generated by the outer side of
> > the join and decides that it's not worth sorting the inner side or
> > building a hash table or using an index, and that the right thing to
> > do is just rescan the inner node on every pass. When the outer side
> > returns three or four orders of magnitude more results than expected,
> > ka-pow!
>
> And then there is the other half of the world, who complain because it
> *didn't* pick a nestloop for some query that would have run in much less
> time if it had.

There's something called "interval arithmetic" I heard about recently
that may help here. The basic idea is to store two values, representing
the upper and lower bounds of a number, instead of its absolute value.
That way you know that the number is going to be somewhere in the middle
and round off becomes a non-less because you know when it's happened
(e.g. the FPU is set up to always round the lower bound down and the
upper bound up). Round-off isn't a problem here, but it's one of the
algebra's nice properties.

If the planning was done with some sort of interval then you'd be
able to encode information about how well your stats characterized the
underlying data. Traditionally awkward things like amount of cache
would serve to drop the lower bound, but not alter the upper. The
planner then automatically propagate performance information through the
calculations, i.e. a nested loop with a tight estimate on a small number
of rows joined to a table with a wider estimate of a small number of
rows would keep the low lower bound but the upper-bound would tend to
make the planner stay away.

That said, I can't decide if that would help in any meaningful way! A
good counter argument seems to be why not just always go with the upper
bound? There are extensions to vanilla interval arithmetic that allow
distributions to be modeled instead of just an unknown interval. You'd
then be able to use some sort of 95% confidence interval instead of a
horribly pessimistic worst case.

--
Sam http://samason.me.uk/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-02-18 15:20:37 Re: pg_migrator progress
Previous Message Peter Eisentraut 2009-02-18 15:05:03 Re: WIP: hooking parser