Re: Memory leak from ExecutorState context?

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Memory leak from ExecutorState context?
Date: 2023-03-10 18:51:14
Message-ID: 20230310195114.6d0c5406@karst
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> So I guess the best thing would be to go through these threads, see what
> the status is, restart the discussion and propose what to do. If you do
> that, I'm happy to rebase the patches, and maybe see if I could improve
> them in some way.

OK! It took me some time, but I did it. I'll try to sum up the situation as
simply as possible.

I reviewed the following threads:

* Out of Memory errors are frustrating as heck!
2019-04-14 -> 2019-04-28

This discussion stalled, waiting for OP, but ideas there ignited all other

* accounting for memory used for BufFile during hash joins
2019-05-04 -> 2019-09-10

This was suppose to push forward a patch discussed on previous thread, but
it actually took over it and more ideas pops from there.

* Replace hashtable growEnable flag
2019-05-15 -> 2019-05-16

This one quickly merged to the next one.

* Avoiding hash join batch explosions with extreme skew and weird stats
2019-05-16 -> 2020-09-24

Another thread discussing another facet of the problem, but eventually end up
discussing / reviewing the BNLJ implementation.

Five possible fixes/ideas were discussed all over these threads:

1. "move BufFile stuff into separate context"
last found patch: 2019-04-21

This patch helps with observability/debug by allocating the bufFiles in the
appropriate context instead of the "ExecutorState" one.

I suppose this simple one has been forgotten in the fog of all other
discussions. Also, this probably worth to be backpatched.

2. "account for size of BatchFile structure in hashJoin"
last found patch: 2019-04-22

This patch seems like a good first step:

* it definitely helps older versions where other patches discussed are way
too invasive to be backpatched
* it doesn't step on the way of other discussed patches

While looking at the discussions around this patch, I was wondering if the
planner considers the memory allocation of bufFiles. But of course, Melanie
already noticed that long before I was aware of this problem and discussion:

2019-07-10: «I do think that accounting for Buffile overhead when estimating
the size of the hashtable during ExecChooseHashTableSize() so it can be
used during planning is a worthwhile patch by itself (though I know it
is not even part of this patch).»

Tomas Vondra agreed with this in his answer, but no new version of the patch
where produced.

Finally, Melanie was pushing the idea to commit this patch no matter other
pending patches/ideas:

2019-09-05: «If Tomas or someone else has time to pick up and modify BufFile
accounting patch, committing that still seems like the nest logical

Unless I'm wrong, no one down voted this.

3. "per slice overflow file"
last found patch: 2019-05-08

This patch has been withdraw after an off-list discussion with Thomas Munro
because of a missing parallel hashJoin implementation. Plus, before any
effort started on the parallel implementation, the BNLJ idea appeared and
seemed more appealing.


By the time, it still seems to have some interest despite the BNLJ patch:

2019-07-10: «If slicing is made to work for parallel-aware hashjoin and the
code is in a committable state (and probably has the threshold I mentioned
above), then I think that this patch should go in.»

But this might have been disapproved later by Tomas:

2019-09-10: «I have to admit I kinda lost track [...] My feeling is that we
should get the BNLJ committed first, and then maybe use some of those
additional strategies as fallbacks (depending on which issues are still
unsolved by the BNLJ).»

4. "Block Nested Loop Join"
last found patch: 2020-08-31

Most of the discussion was consideration about the BNLJ parallel and
semi-join implementation. Melanie put a lot of work on this. This looks like
the most advanced patch so far and add a fair amount of complexity.

There were some open TODOs, but Melanie was waiting for some more review and
feedback on v11 first.

5. Only split the skewed batches
Discussion: 2019-07-11

Robert Haas pointed out that current implementation and discussion were not
really responding to the skew in a very effective way. He's considering
splitting batches unevenly. Hubert Zhang stepped in, detailed some more and
volunteer to work on such a patch.

No one reacted.

It seems to me this is an interesting track to explore. This looks like a
good complement of 2. ("account for size of BatchFile structure in hashJoin"
patch). However, this idea probably couldn't be backpatched.

Note that it could help with 3. as well by slicing only the remaining skewed

Also, this could have some impact on the "Block Nested Loop Join" patch if
the later is kept to deal with the remaining skewed batches.

> I was hoping we'd solve this by the BNL, but if we didn't get that in 4
> years, maybe we shouldn't stall and get at least an imperfect stop-gap
> solution ...

Indeed. So, to sum-up:

* Patch 1 could be rebased/applied/backpatched
* Patch 2 is worth considering to backpatch
* Patch 3 seemed withdrawn in favor of BNLJ
* Patch 4 is waiting for some more review and has some TODO
* discussion 5 worth few minutes to discuss before jumping on previous topics

1 & 2 are imperfect solution but doesn't weight much and could be backpatched.
4 & 5 are long-term solutions for a futur major version needing some more
discussions, test and reviews.
3 is not 100% buried, but a last round in the arena might settle its destiny for

Hopefully this sum-up is exhaustive and will help clarify this 3-years-old


In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-03-10 18:57:55 Re: RLS makes COPY TO process child tables
Previous Message Tom Lane 2023-03-10 17:28:28 Re: pg_usleep for multisecond delays