Re: enable_incremental_sort changes query behavior

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Jaime Casanova <jaime(dot)casanova(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: enable_incremental_sort changes query behavior
Date: 2020-10-03 21:44:40
Message-ID: 20201003214440.fe4674dce5q7z4bg@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Oct 03, 2020 at 10:50:06AM -0400, James Coleman wrote:
>On Fri, Oct 2, 2020 at 11:16 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
>> On Fri, Oct 2, 2020 at 7:07 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
>> >
>> > On Fri, Oct 2, 2020 at 6:28 PM Tomas Vondra
>> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > >
>> > > On Fri, Oct 02, 2020 at 05:45:52PM -0400, James Coleman wrote:
>> > > >On Fri, Oct 2, 2020 at 4:56 PM Tomas Vondra
>> > > ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > > >>
>> > > >> ...
>> > > >>
>> > > >> More importanly, it does not actually fix the issue - it does fix that
>> > > >> particular query, but just replacing the DISTINCT with either ORDER BY
>> > > >> or GROUP BY make it fail again :-(
>> > > >>
>> > > >> Attached is a simple script I used, demonstrating these issues for the
>> > > >> three cases that expect to have ressortgroupref != 0 (per the comment
>> > > >> before TargetEntry in plannodes.h).
>> > > >
>> > > >So does checking for volatile expressions (if you happened to test
>> > > >that) solve all the cases? If you haven't tested that yet, I can try
>> > > >to do that this evening.
>> > > >
>> > >
>> > > Yes, it does fix all the three queries in the SQL script.
>> > >
>> > > The question however is whether this is the root issue, and whether it's
>> > > the right way to fix it. For example - volatility is not the only reason
>> > > that may block qual pushdown. If you look at qual_is_pushdown_safe, it
>> > > also blocks pushdown of leaky functions in security_barrier views. So I
>> > > wonder if that could cause failures too, somehow. But I haven't managed
>> > > to create such example.
>> >
>> > I was about to say that the issue here is slightly different from qual
>> > etc. pushdown, since we're not concerned about quals here, and so I
>> > wonder where we determine what target list entries to put in a given
>> > scan path, but then I realized that implies (maybe!) a simpler
>> > solution. Instead of duplicating checks on target list entries would
>> > be safe, why not check directly in get_useful_pathkeys_for_relation()
>> > whether or not the pathkey has a target list entry?
>> >
>> > I haven't been able to try that out yet, and so maybe I'm missing
>> > something, but I need to step out for a bit, so I'll have to look at
>> > it later.
>> I've started poking at this, but haven't yet found a workable
>> solution. See the attached patch which "solves" the problem by
>> breaking putting the sort under the gather merge, but it at least
>> demonstrates conceptually what I think we're interested in doing.
>> The issue currently is that the comparison of expressions fails -- in
>> our query, the first column selected shows up as a Var (with the same
>> contents) but is a different pointer in the em_expr and the reltarget
>> exprs list. I don't yet see a good way to get the equivalence class
>> for a PathTarget entry.
>Hmm, I think I was looking to do is the attached patch. I didn't
>realize until I did a lot of reading through source that we have an
>equal() function that can compare exprs. That (plus the realization in
>[1] the originally reported backtrace wasn't where the error actually
>came from) convinced me that what we need is to confirm not only that
>the all of the vars in the ec member come from the relids in the rel
>but also that the expr is actually represented in the target of the
>With the gucs changed as I mentioned earlier both of the plans (with
>and without a volatile call in the 2nd select entry) now look like:
> Unique
> -> Gather Merge
> Workers Planned: 2
> -> Sort
> Sort Key: ref_0.i, (md5(ref_0.t))
> -> Nested Loop
> -> Parallel Index Scan using ref_0_i_idx on ref_0
> -> Function Scan on generate_series ref_1
>Without the gucs changed the minimal repro case now doesn't error, but
>results in this plan:
> HashAggregate
> Group Key: ref_0.i, CASE WHEN pg_rotate_logfile_old() THEN ref_0.t
>ELSE ref_0.t END
> -> Nested Loop
> -> Seq Scan on ref_0
> -> Function Scan on generate_series ref_1
>Similarly in your six queries I now only see parallel query showing up
>in the last one.

OK, that seems reasonable I think.

>I created an entirely new function because adding the target expr
>lookup to the existing find_em_expr_for_rel() function broke a bunch
>of postgres_fdw tests. That maybe raises questions about whether that
>code also could have problems in theory/in the future, but I didn't
>investigate further. In any case we already know it excludes
> maybe it's fine because in practice that's actually a
>broader exclusion than what we're doing here.

I don't think postgres_fdw needs these new checks, because FDW scan's
are not allowed in parallel part of the plan - if that changes in the
future, I'm sure that'll require fixes in plenty other places.

OTOH it's interesting that it triggers those failures - I wonder if we
could learn something from them? AFAICS those paths can't be built by
generate_useful_gather_paths (because of the postgres_fdw vs. parallel
query restrictions), so how do these plans look?

>This seems to fix the issue, but I'd like feedback on whether it's too
>strict. We could of course just check em_has_volatile, but I'm
>wondering if that's simultaneously too strict (by not allowing the
>volatile expression to be computed in the gather merge supposing
>there's no join) and too loose (maybe there are other cases we should
>care about?). It also just strikes me as re-encoding rules that should
>have already been applied (and thus we should be able to look up in
>the data we have if it's safe to use the expr/pathkey). Those are
>mostly intuitions though.

I don't know :-( As I mentioned before, I suspect checking just the
volatility may not be enough in some cases (leakyness, etc.) and I think
you're right it may be too strict in other cases.

Not sure I understand which rules you think we're re-encoding, but I
have a nagging feeling there's a piece of code somewhere earlier in
the query planner meant to prevent such cases (sort on path without all
the pathkeys), and that fixing it here is just a band aid :-(

I'm sure we're building plans with Sort on top of Index Scan paths, so
how come those don't fail? How come the non-parallel version of these
queries don't fail?


Tomas Vondra
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-10-03 22:02:36 Re: terminate called after throwing an instance of 'std::bad_alloc'
Previous Message Rémi Lapeyre 2020-10-03 21:42:52 Re: Add header support to text format and matching feature