Re: An Idea for planner hints

From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-23 15:42:10
Message-ID: 44EC7752.20807@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jim C. Nasby wrote:
> On Tue, Aug 22, 2006 at 11:56:17AM -0700, Mark Dilger wrote:
>> I proposed something like this quite a bit up-thread. I was hoping we
>> could have a mode in which the system would run the second, third, fourth,
>> ... best plans rather than just the best looking one, and then determine
>> from actual runtime statistics which was best. (The proposal also included
>> the ability to output the best plan and read that in at a later time in
>> lieu of a SQL query, but that part of it can be ignored if you like.) The
>> posting didn't generate much response, so I'm not sure what people thought
>> of it. The only major problem I see is getting the planner to keep track
>> of alternate plans. I don't know the internals of it very well, but I
>> think the genetic query optimizer doesn't have a concept of "runner-up #1",
>> "runner-up #2", etc., which it would need to have.
>
> I think the biggest issue is that you'd have to account for varying load
> on the box. If we assume that the database is the only thing running on
> the box, we might be able to do that by looking at things like how much
> IO traffic we generated (though of course OS caching will screw with
> that).
>
> Actually, that's another issue... any plans run after the first one will
> show up as being artificially fast, since there will be a lot of extra
> cached data.

Yes, caching issues prevent you from using wall-clock time. We could instrument
the code to count the number of rows vs. the number predicted for each internal
join, from which new cost estimates could be generated.

Perhaps you can check my reasoning for me: I'm imagining a query which computes
AxBxCxD, where A, B, C, and D are actual tables. I'm also imagining that the
planner always chooses AxB first, then joins on C, then joins on D. (It does so
because the single-table statistics suggest this as the best course of action.)
It might be that AxD is a really small metatable, much smaller than would be
estimated from the statistics for A independent of the statistics for D, but AxB
is pretty much what you would expect given the independent statistics for A and
B. So we need some way for the system to stumble upon that fact. If we only
ever calculate cross-join statistics for plans that the system chooses, we will
only discover that AxB is about the size we expected it to be. So, if the
actual size of AxB is nearly equal to the estimated size of AxB, the system will
continue to choose the same plan in future queries, totally ignorant of the
advantages of doing AxD first.

That last paragraph is my reasoning for suggesting that the system have a mode
in which it runs the "runner-up #1", "runner-up #2", etc sorts of plans. Such a
mode could force it down alternate paths where it might pick up interesting
statistics that it wouldn't find otherwise.

This idea could be changed somewhat. Rather than running the other plans, we
could just extract from them which alternate joins they include, and consider
also calculating those join statistics.

mark

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2006-08-23 15:46:28 Re: Allow commenting of variables in
Previous Message Peter Eisentraut 2006-08-23 15:41:20 Re: Allow commenting of variables in postgresql.conf to -