Re: Optimizer questions

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject: Re: Optimizer questions
Date: 2016-03-09 01:30:13
Message-ID: 27048.1457487013@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> writes:
> On 03/08/2016 07:01 AM, Tom Lane wrote:
>> I've not spent a lot of time on this, but I think maybe what would make
>> sense is to consider both the case where function calculations are
>> postponed to after ORDER BY and the case where they aren't, and generate
>> Paths for both. Neither approach is a slam-dunk win. For example,
>> suppose that one of the tlist columns is md5(wide_column) --- it will
>> likely not be preferable to pass the wide column data through the sort
>> step rather than reducing it to a hash first. This would require some
>> work in grouping_planner to track two possible pathtargets, and work in
>> create_ordered_paths to generate paths for both possibilities. A possible
>> objection is that this would add planning work even when no real benefit
>> is possible; so maybe we should only consider the new way if the tlist has
>> significant eval cost? Not sure about that. There is also something
>> to be said for the idea that we should try to guarantee consistent
>> semantics when the tlist contains volatile functions.

> Attached please find rebased patch.

This may be rebased, but it doesn't seem to respond to any of my concerns
above. In particular, if we're going to change behavior in this area,
I think we need to try to ensure that volatile functions in the tlist will
see consistent behavior no matter which plan shape is chosen. People may
well be depending on the existing behavior for particular queries. If
we're going to break their queries, I'd at least like to be able to say
that there's now a predictable, well-defined semantics. Something like
"volatile functions in the tlist that are not in sort/group columns are
certain to be evaluated after sorting occurs, if there is an ORDER BY".
This should not be dependent on whether there's a LIMIT, nor GROUP BY
nor windowing functions, nor on random other whims of the optimizer.
So I'm not at all happy with a patch that changes things only for the
LIMIT-but-no-grouping-or-windows case.

I'm not sure whether it's worth pursuing cost-based comparisons for
functions that aren't volatile. In an ideal world we could deal with the
md5() example I gave, but as things stand I suspect we usually have too
little idea whether the result of f(x) is wider or narrower than x itself.
(One thing we do know is that f's output won't be toasted, whereas x might
be; so it might be a bad idea to bet on function results getting
narrower.) So I'm afraid it's going to be really hard to tell whether
we're making the sort step itself more or less expensive if we postpone
function evals. But we do know for certain that we're adding a projection
step that wasn't there before when we postpone functions to after the
Sort. That kind of suggests that there should be some minimum estimated
cost of the functions before we add that extra step (unless they are
volatile of course). Or only do it if there is a LIMIT or we have a
tuple_fraction suggesting that partial eval of the query is of interest.

BTW, there's some additional refactoring I had had in mind to do in
grouping_planner to make its handling of the targetlist a bit more
organized; in particular, I'd like to see it using PathTarget
representation more consistently throughout the post-scan-join steps.
I had figured that was just neatnik-ism and could be done later,
but we may need to do it now in order to be able to deal with these
considerations in a cleaner fashion. I really don't like the way
that this patch hacks up the behavior of make_scanjoin_target() without
even a comment explaining its new API.

The rough sketch of what I'd had in mind is that we should convert the
processed, final tlist into PathTarget form immediately after
query_planner, since we know we're going to need that eventually anyway.
Then, if we need to do grouping/aggregation or windowing, derive modified
PathTargets that we want to use for the inputs of those steps. (This'd
require some infrastructure that doesn't currently exist for manipulating
PathTargets, particularly the ability to copy a PathTarget and/or add a
column to an existing PathTarget.)

The way this optimization would fit into that structure is that the
DISTINCT/ORDER BY steps would now also have a desired input pathtarget
that might be different from the final target.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-03-09 01:39:18 Re: Performance improvement for joins where outer side is unique
Previous Message Amit Langote 2016-03-09 01:29:38 Re: [PROPOSAL] VACUUM Progress Checker.