Re: Path question

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Hans-Jürgen Schönig <hs(at)cybertec(dot)at>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Path question
Date: 2010-09-23 21:34:02
Message-ID: AANLkTikQZsEiWmrFTzEkv7tnn2h5GTp6A0ybU8Gf2Xfk@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2010/9/20 Robert Haas <robertmhaas(at)gmail(dot)com>:
> 4. TGL: "In any case, I'm amazed that it's not failing regression
> tests all over the place with those critical tests in
> make_sort_from_pathkeys lobotomized by random #ifdef FIXMEs.  Perhaps
> we need some more regression tests...".  Obviously, we need to remove
> that lobotomy and insert the correct fix for whatever problem it was
> trying to solve.  Adding some regression tests seems wise, too.

So, I spent a lot of time today trying to understand why these random
FIXMEs were inserted and just exactly what they break or, uh, fix. I
didn't get very far. The regression tests all pass with the latest
version of Merge Append, with and without these FIXMEs, and in fact
when I extracted just those changes and applied them to HEAD, the
regression tests still pass. So I said to myself: "Let's come up with
a regression test that illustrates why these changes are Evil and
Dangerous, and then we can add that to the regression test suite, per
Tom's suggestion." This proved to be easier said than done:
apparently our code is so good that it works even with random sections
commented out. Go us.

As a debugging aide, instead of actually diking these tests out, I
made them elog(LOG). This has the same effect as removing them
entirely but it makes it possible to see whether you've triggered the
relevant code path. Then I tried to trigger those code paths, which I
shall hereinafter refer to as FIXME #1, FIXME #2, FIXME #3, and FIXME
#4, as per the attached patch. It proved to be pretty easy to trigger
FIXME #3; indeed, just about every query I tried did the job.

CREATE TABLE parent (a integer NOT NULL, b integer NOT NULL);
CREATE TABLE child (b integer, a integer, c text NOT NULL) INHERITS (parent);
CREATE TABLE joinme (j integer NOT NULL);

I populated joinme with a single row with the value 1, and child with
a million rows with a running from 1 to a million, b running from 10
million down to 9 million and 1, and c getting random()::text ||
random()::text || random()::text || random()::text. Then you can
trigger FIXME #3 with either of these two (the second also triggers
FIXME #4):

select * from joinme, parent where a = j and j = 1;
select * from parent order by a;

I believe that the first case fires because of the "ec_has_const"
condition and the second from the "EC only has at most one member
condition". However, it's difficult to see what problem this actually
causes. From what I gather from reading the code, em_is_child
EquivalenceMembers are only supposed to affect the construction of
inner index-scan paths; and certainly if there's at most one
EquivalenceMember it's unclear to me how you'd be forming a join.
Maybe it's possible to break something in the ec_has_const case, but I
haven't been to puzzle out what that thing is.

FIXME #1 and FIXME #2 were much harder to trigger. In fact, barring
significant further lobotimization of the code, I couldn't. For FIXME
#1, the right sort of query seems to be something like this:

select 1 from joinme left join parent on a = j where j = 1 order by a;

...but the problem is that in every example I could construct, the
em_is_child EquivalenceMembers were *after* the members for the parent
rel, so you never get far enough in the loop to see them. For related
reasons, I couldn't get FIXME #2 to fire, either: in order to have a
chance of firing FIXME #2, you have to get all the way through the
loop where FIXME #1 is located without finding a match. I can't
believe all of this code is here just for fun, but in every example I
could come up with you quickly find a match in the first loop, and
never even finish visiting all the members of that list, let alone
reach the second loop. Somehow, you need to construct a case where
the values to be sorted aren't directly emitted by the node
immediately under the sort, but I couldn't figure out how to do that -
no matter how I rearranged things, the planner just computed the
necessary outputs one level down. Anyone have an idea?

I did find one apparent weirdness in the way explain outputs sort keys, namely:

rhaas=# explain (costs off) select 2 as x union all select 1 as x order by x;
QUERY PLAN
--------------------
Sort
Sort Key: (2)
-> Append
-> Result
-> Result
(5 rows)

This seems to have nothing to do with the problem at hand, but it
looks pretty weird. The planner is in fact sorting on the column, not
on the literal value 2, so this is (I think) just a display error.
"x" would be a more reasonable representation of the sort key.

All of this leaves me wondering why Greg ended up ifdefing out this
code in the first place. There's probably something I'm missing
here... but for now I can't think of a better idea than just removing
the #ifdefs and hoping that whatever problem they were causing was
limited to an earlier version of the code that no longer exists.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
planner_lobotomy.patch application/octet-stream 2.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-09-23 21:36:07 Re: git cherry-pick timestamping issue
Previous Message Abhijit Menon-Sen 2010-09-23 21:31:35 Re: git cherry-pick timestamping issue