Partitionwise join fails under GEQO

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Partitionwise join fails under GEQO
Date: 2020-10-05 01:17:33
Message-ID: 1683100.1601860653@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

If you run the core regression tests with geqo_threshold = 2
(injected, say, via ALTER SYSTEM SET), you will notice the
earlier tests showing some join ordering differences, which
is unsurprising ... but when it gets to partition_join, that
test just dumps core.

Investigation shows that the crash is due to a corrupt EquivalenceClass
member. It gets that way because add_child_join_rel_equivalences
is unaware of the planner's memory allocation policies, and feels
free to mess with the EC data structures during join planning.
When GEQO's transient memory context is thrown away after one round
of GEQO planning, some still-referenced EC data goes with it,
resulting in a crash during the next round.

I band-aided around that with the attached patch, which is enough
to prevent the crash, but it's entirely unsatisfactory as a production
solution. The problem is that each GEQO round will re-add the same
child EC members, since add_child_join_rel_equivalences has absolutely
no concept that the members it wants might be there already. Since
GEQO tends to use a lot of rounds, this can run to significant memory
bloat, not to mention a pretty serious speed hit while EC operations
thumb through all of those duplicate expressions.

Now that I've seen this, I wonder whether add_child_join_rel_equivalences
might not be making duplicate EC entries even without GEQO. Is there any
guarantee that it's not called repeatedly on related join-rel sets?

So I think that in addition to this, we need some check to see if the
derived EC child member is already there. It's likely that checking
for an em_relids match would be sufficient to eliminate irrelevant
members quickly.

regards, tom lane

Attachment Content-Type Size
bandaid-for-geqo-crash.patch text/x-diff 1.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-10-05 01:18:24 Re: A modest proposal: let's add PID to assertion failure messages
Previous Message Michael Paquier 2020-10-05 01:05:54 Re: Add header support to text format and matching feature