Re: BUG #15795: ERROR: could not find pathkey item to sort

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: suresh(dot)arsenal29(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15795: ERROR: could not find pathkey item to sort
Date: 2019-05-09 16:06:18
Message-ID: 24167.1557417978@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> writes:
> On 2019/05/08 19:33, Tom Lane wrote:
>> For the archives, though, this isn't hard to reproduce:
>>
>> regression=# create table person(_id int, _actual_type_name text, name text);
>> CREATE TABLE
>> regression=# SELECT DISTINCT A._id0 as _id0, A._actual_type_name0 as _actual_type_name0
>> FROM ( ( SELECT DISTINCT _id as _id0, _actual_type_name as
>> _actual_type_name0, name as name0 FROM person ) union all (
>> SELECT DISTINCT _id as _id0, _actual_type_name as _actual_type_name0, name
>> as name0 FROM person)) as A WHERE ( A.name0 = A.name0 );
>> ERROR: could not find pathkey item to sort
>>
>> Curiously, this only fails for me in 9.6 and 10, not earlier or later
>> branches.

> ... In any case, the main problem
> seems to be that convert_subquery_pathkeys() can't keep "name" from
> appearing in the output pathkeys that it produces. Based on that premise,
> I added the following code to convert_subquery_pathkeys():

> +
> + if (retvallen == outer_query_keys)
> + break;

That's just a kluge.

> which seems to fix the issue. Alternatively, maybe we can apply
> truncate_useless_pathkeys() to the result of convert_subquery_pathkeys().

Yeah, I thought about that too. The comment on convert_subquery_pathkeys
that says truncate_useless_pathkeys would do nothing is wrong. However,
if you apply truncate_useless_pathkeys to the results, you'll find that
some of the regression test query plans change, and not for the better.
I'm thinking of changing that comment to read like

* We intentionally don't do truncate_useless_pathkeys() here, because there
* are situations where seeing the raw ordering of the subquery is helpful.
* For example, if it returns ORDER BY x DESC, that may prompt us to
* construct a mergejoin using DESC order rather than ASC order; but the
* right_merge_direction heuristic would have us throw the knowledge away.

Anyway, yes, the basic issue is that convert_subquery_pathkeys is
returning pathkeys that are outside the norm for the outer query,
and some places are falling over because of that. I've found at
least two bugs that I believe are present in multiple branches,
although I'm having difficulty in constructing branch-independent
tests for them, because the bugs interact with each other and with
other changes that we've made.

First off, the reported problem can be reproduced with an existing
regression-test table, eg

regression=# select distinct q1 from (select distinct * from int8_tbl union all select distinct * from int8_tbl) ss where (q2 = q2);
ERROR: could not find pathkey item to sort

The reason that this doesn't fail in >= v11 is that we do not build
an EquivalenceClass for q2 in the outer query, so that there's nothing
for convert_subquery_pathkeys to match it to. And that happens because
the q2 = q2 clause is converted to "q2 is not null". In <= v10, that
clause is left alone and since it's a mergejoinable operator we end
up assigning ECs to each side. (Conceivably, since it's not actually
a join clause, we don't need to do that, but I'm hesitant to muck with
that; it would only be a band-aid over the true problem anyhow.)

Anyway, what's happening in v10 is

1. convert_subquery_pathkeys sees that the subpath is sorted by q1, q2,
and it successfully generates a 2-element pathkeys list representing
that in the outer query, which it can do because (a) q2 is in the
subpath tlist and (b) there is an EC in the outer query for q2.
The other UNION ALL arm has an identical path, too.

2. When we come to build a MergeAppend path for the UNION ALL, we
label it with the common pathkeys list of the two inputs, i.e.
q1, q2, although the outer query really only cares about sort-by-q1.

3. When we come to build a plan for the MergeAppend, we need to
locate the sort columns in the outputs of its child SubqueryScan
plans ... and q2 is not there. Ooops.

Now, how is that happening given that convert_subquery_pathkeys
is specifically checking that the column is emitted by the subquery?
The reason is that *it's checking the wrong thing*. It's looking
at what the native output of the subquery is, not at what the
SubqueryScan that we're going to stack atop it will produce.
And in this case, because q2 is not required by the outer query
(once we've pushed the WHERE clause down to or below the subquery
scan), q2 is not in the reltarget list for the subquery RTE, so
it's not going to be emitted by the SubqueryScan.

I haven't decided what to do about this. The minimally invasive
fix would be to teach convert_subquery_pathkeys that it can't emit
anything not listed in rel->reltarget. That seems like it might
lose useful information, though perhaps the consequences are minimal
given how hard it is to hit this bug. Better would be to add
entries to the reltarget for useful sort columns --- but I think
it's likely too late to do so here. (We may have already generated
some paths using the existing reltarget contents.)

Anyway, we're not out of the woods by any means. Wondering whether
the issue couldn't be reproduced in >= v11 by doing something that
wouldn't be reduced to an IS NOT NULL, I stumbled across this in HEAD:

regression=# select distinct q1 from (select distinct * from int8_tbl union all select distinct * from int8_tbl) ss where (-q1 = q2);
psql: server closed the connection unexpectedly

That's hitting this assertion:

TRAP: FailedAssertion("!(list_length(dest_tlist) == list_length(src_tlist))", File: "tlist.c", Line: 345)

Investigation shows that in this case, we successfully generate
a Plan, but the final apply_tlist_labeling step fails, because
create_merge_append_plan has blithely ignored the CP_EXACT_TLIST
flag and stuck on extra resjunk columns. That seems like a pretty
clear violation of createplan.c's internal expectations, so I made
a patch that teaches that function to stick on a filtering Result
if it added extra columns in violation of a flag demanding that it
not do so. As of HEAD, create_append_plan has similar logic so I
made the same change there, although I have not found a way to reach
that bug today. (See patch attached.)

With that patch, we successfully build a plan, but it looks a bit odd:

Unique
-> Result
-> Merge Append
Sort Key: "*SELECT* 1".q1, ((- "*SELECT* 1".q1))
-> Subquery Scan on "*SELECT* 1"
-> Unique
-> Sort
Sort Key: i81.q1, i81.q2
-> Seq Scan on int8_tbl i81
Filter: ((- q1) = q2)
-> Subquery Scan on "*SELECT* 2"
-> Unique
-> Sort
Sort Key: i82.q1, i82.q2
-> Seq Scan on int8_tbl i82
Filter: ((- q1) = q2)

What's happening there is that the outer query has an EC containing
{ -q1, q2 }, and convert_subquery_pathkeys successfully matches the q2
pathkey to that EC. Then, when prepare_sort_from_pathkeys tries to
find a sort column for that EC, it still doesn't find q2 in the
SubqueryScan output --- but it does find q1, so it can make an
expression matching the other EC entry instead of failing.

This is kind of grotty, of course. It'd be nicer if the outer
MergeAppend only considered as many sort columns as we actually
semantically need. But I'm not sure of a good way to arrange that.

Another thing that I've not quite got to the bottom of is how come the
MergeAppend path is getting stuck with CP_EXACT_TLIST responsibility in
the first place. Generally, since we know MergeAppend can't project,
there'd be a ProjectionPath on top of it. Usually there is, which is
how come this bug has escaped detection this long --- so maybe there's
another bug that we really ought to be using a ProjectionPath but are
not, in some particular circumstance? Not that I think it would be okay
for create_merge_append_plan to just ignore the flag.

Anyway, because there are so many different things that can mask these
bugs, getting good test cases for them might be hopeless. In the
attached I have

select distinct q1 from
(select distinct * from int8_tbl i81
union all
select distinct * from int8_tbl i82) ss
where q2 = q2;

which exposes the reltarget-vs-subplan-tlist issue, but only in 9.6
and 10, though it surely exists in other branches including HEAD, and

select distinct q1 from
(select distinct * from int8_tbl i81
union all
select distinct * from int8_tbl i82) ss
where -q1 = q2;

which exposes create_merge_append_plan's misfeasance, but only in
11 and HEAD, though it surely exists at least as far back as v10.
Getting more robust test cases would be nice -- any ideas?

regards, tom lane

Attachment Content-Type Size
handle-exact-tlist-for-append-plans-1.patch text/x-diff 12.9 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alan Jackson 2019-05-09 16:45:55 inconsistent results querying table partitioned by date
Previous Message Francisco Olarte 2019-05-09 07:58:32 Re: prerunscript.command.line.error