Re: enable_incremental_sort changes query behavior

From: James Coleman <jtc331(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jaime Casanova <jaime(dot)casanova(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: enable_incremental_sort changes query behavior
Date: 2020-11-25 19:53:52
Message-ID: CAAaqYe8thUdoYMPFa9QhENZSKhV-AcZG9MeSEUGT5m46b7cNNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 24, 2020 at 2:31 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> James Coleman <jtc331(at)gmail(dot)com> writes:
> > On Mon, Nov 23, 2020 at 2:24 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> 1. James was wondering, far upthread, why we would do projections
> >> pre-sort or post-sort. I think the answers can be found by studying
> >> planner.c's make_sort_input_target(), which separates out what we want
> >> to do pre-sort and post-sort.
>
> > Does it imply we need to intentionally avoid SRFs also?
>
> It's sort of a wart that we allow SRFs in ORDER BY at all, but my
> expectation is that make_sort_input_target would prevent lower levels
> of the planner from needing to think about that. We don't allow SRFs
> in index expressions, nor in WHERE clauses (so they'll never come up
> as mergejoin sort keys). So really there's no way that scan/join
> processing should need to consider such cases. Such a sort would
> only ever be implemented via a final Sort atop ProjectSet.

In this case though we're sorting based on "interesting" pathkeys,
which means we don't don't necessarily have index expressions or
mergejoin sort keys. For example, even with the bugfix patch (from the
parallel fix thread I've linked to previously) applied, I'm able to
generate this (with of course some GUCs "tweaked"):

select unique1 from tenk1 order by unnest('{1}'::int[]);

Gather Merge
Workers Planned: 2
-> Sort
Sort Key: (unnest('{1}'::integer[]))
-> ProjectSet
-> Parallel Index Only Scan using tenk1_unique1 on tenk1

If I understand correctly, that's a violation of the spirit of what
the comments above make_sort_input_target(). If that's true, then it
should be pretty easy to disallow them from being considered. Given
the existing restrictions on where SRFs can be placed in a SELECT
(e.g., no CASE/COALESCE) and my assumption (without having thoroughly
verified this) that SRFs aren't allowed as arguments to functions or
as arguments to any other expression (I assume only scalars are
allowed), would it be sufficient to check the pathkey expression
(without recursion) to see if it's a FuncExpr that returns a set?

> >> 2. Robert is correct that under normal circumstances, the targetlists of
> >> both baserels and join rels contain only Vars.
>
> > Is that primarily a project policy? Or a limitation of our current
> > planner (just can't push things down/carry the results back up as much
> > as we'd like)? Or something else? In particular, it seems plausible
> > there are cases where pushing down the sort on a non-Var expression to
> > the base rel could improve plans, so I'm wondering if there's a reason
> > to intentionally avoid that in the long or short run (or both).
>
> I think you've just rediscovered Joe Hellerstein's thesis topic [1].
> We ripped out the remnants of that code ages ago (search very early
> git states for "JMH" if you're interested), because the complexity
> vs. benefit ratio seemed pretty bad. Maybe there'll be a case for
> putting it back some day, but I'm dubious. Note that we have the
> ability to push down sorts-on-expressions anyway; that's not constrained
> by what is in the relation targetlists.

I'm doing some grepping now to see what that work entailed, but I'm
not sure that what I'm describing is the same thing. By "pushing down
a non-Var expression to the base rel" I mean what (to my ears) sounds
like what you're saying by "we have the ability to push down
sorts-on-expressions anyway; that's not constrained by what is in the
relation targetlists". The target list entries I'm talking about for
these expressions are the ones inserted by
prepare_sort_from_pathkeys(), which in PG13 happens just a bit more
frequently since, at least in parallel query, consider sort paths
lower than we did before based on interesting pathkeys (for now just
query_pathkeys) in generate_useful_gather_paths().

> > Interesting: so merge joins are an example of us pushing down sorts,
> > which I assume means (part of) the answer to my question on (2) is
> > that there's nothing inherently wrong/broken with evaluating
> > expressions lower down the tree as long as we're careful about what is
> > safe/unsafe with respect to volatility and parallelism?
>
> Right, I don't see any fundamental problem with that, we just have
> to be careful about these constraints.

Great. That's exactly what I was looking for. To summarize: the
constraints are on volatility, parallel safety, and SRFs.

> > I have wondered if we should strictly require the expression to be in
> > the target list even if nonvolatile, but prepare_sort_from_pathkeys
> > doesn't seem to think that's necessary, so I'm comfortable with that
> > unless there's something you know we haven't considered.
>
> No, prepare_sort_from_pathkeys is happy to build a sort expression if it
> can't find it already computed in the input. The secret here is that
> we should never get to that code with a "dangerous" sort expression,
> because we should never have made a Path that would request such a thing
> in the first place. It's fairly obvious to me how we deal with the
> consideration for volatile sortkeys. We cannot restrict parallel-unsafe
> sortkeys quite the same way, because the restriction only applies to
> parallel not non-parallel plans. Maybe it's sufficient to mark Paths
> as parallel unsafe if they sort by parallel-unsafe sortkeys? Is there
> anyplace that is Just Assuming that paths it builds will be
> parallel-safe?

It seems actually quite simple as I understand it: in my proposed
patch in [1] we know in find_em_expr_usable_for_sorting_rel() whether
or not we're working on parallel paths, and so we're able to rely
simply on is_parallel_safe() for the expr under consideration.

James

1: https://www.postgresql.org/message-id/flat/CAAaqYe8cK3g5CfLC4w7bs%3DhC0mSksZC%3DH5M8LSchj5e5OxpTAg%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-11-25 20:00:32 Re: proposal: possibility to read dumped table's name from file
Previous Message Pavel Stehule 2020-11-25 19:35:30 Re: SEARCH and CYCLE clauses