Re: Memory leak from ExecutorState context?

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Memory leak from ExecutorState context?
Date: 2023-05-10 12:24:19
Message-ID: 20230510142419.6185a492@karst
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for your review!

On Mon, 8 May 2023 11:56:48 -0400
Melanie Plageman <melanieplageman(at)gmail(dot)com> wrote:

> ...
> > 4. accessor->read_buffer is currently allocated in accessor->context as
> > well.
> >
> > This buffer holds tuple read from the fileset. This is still a buffer,
> > but not related to any file anymore...
> >
> > Because of 3 and 4, and the gross context granularity of SharedTuplestore
> > related-code, I'm now wondering if "fileCxt" shouldn't be renamed "bufCxt".
>
> I think bufCxt is a potentially confusing name. The context contains
> pointers to the batch files and saying those are related to buffers is
> confusing. It also sounds like it could include any kind of buffer
> related to the hashtable or hash join.
>
> Perhaps we could call this memory context the "spillCxt"--indicating it
> is for the memory required for spilling to permanent storage while
> executing hash joins.

"Spilling" seems fair and a large enough net to grab everything around temp
files and accessing them.

> I discuss this more in my code review below.
>
> > From c5ed2ae2c2749af4f5058b012dc5e8a9e1529127 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
> > Date: Mon, 27 Mar 2023 15:54:39 +0200
> > Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated
> > context
>
> > diff --git a/src/backend/executor/nodeHash.c
> > b/src/backend/executor/nodeHash.c index 5fd1c5553b..a4fbf29301 100644
> > --- a/src/backend/executor/nodeHash.c
> > +++ b/src/backend/executor/nodeHash.c
> > @@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List
> > *hashOperators, List *hashCollations,
> > if (nbatch > 1 && hashtable->parallel_state == NULL)
> > {
> > + MemoryContext oldctx;
> > +
> > /*
> > * allocate and initialize the file arrays in hashCxt (not
> > needed for
> > * parallel case which uses shared tuplestores instead of
> > raw files) */
> > + oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
> > +
> > hashtable->innerBatchFile = palloc0_array(BufFile *,
> > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> > /* The files will not be opened until needed... */
> > /* ... but make sure we have temp tablespaces established
> > for them */
>
> I haven't studied it closely, but I wonder if we shouldn't switch back
> into the oldctx before calling PrepareTempTablespaces().
> PrepareTempTablespaces() is explicit about what memory context it is
> allocating in, however, I'm not sure it makes sense to call it in the
> fileCxt. If you have a reason, you should add a comment and do the same
> in ExecHashIncreaseNumBatches().

I had no reason. I catched it in ExecHashIncreaseNumBatches() while reviewing
myself, but not here.

Line moved in v6.

> > @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
> > hashtable, nbatch, hashtable->spaceUsed);
> > #endif
> >
> > - oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
> > -
> > if (hashtable->innerBatchFile == NULL)
> > {
> > + MemoryContext oldcxt =
> > MemoryContextSwitchTo(hashtable->fileCxt); +
> > /* we had no file arrays before */
> > hashtable->innerBatchFile = palloc0_array(BufFile *,
> > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> > +
> > + MemoryContextSwitchTo(oldcxt);
> > +
> > /* time to establish the temp tablespaces, too */
> > PrepareTempTablespaces();
> > }
> > @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
>
> I don't see a reason to call repalloc0_array() in a different context
> than the initial palloc0_array().

Unless I'm wrong, wrapping the whole if/else blocks in memory context
hashtable->fileCxt seemed useless as repalloc() actually realloc in the context
the original allocation occurred. So I only wrapped the palloc() calls.

> > diff --git a/src/backend/executor/nodeHashjoin.c
> > ...
> > @@ -1308,21 +1310,27 @@ ExecParallelHashJoinNewBatch(HashJoinState
> > *hjstate)
> > * The data recorded in the file for each tuple is its hash value,
>
> It doesn't sound accurate to me to say that it should be called *in* the
> HashBatchFiles context. We now switch into that context, so you can
> probably remove this comment and add a comment above the switch into the
> filecxt which explains that the temp file buffers should be allocated in
> the filecxt (both because they need to be allocated in a sufficiently
> long-lived context and to increase visibility of their memory overhead).

Indeed. Comment moved and reworded in v6.

> > diff --git a/src/include/executor/hashjoin.h
> > b/src/include/executor/hashjoin.h index 8ee59d2c71..74867c3e40 100644
> > --- a/src/include/executor/hashjoin.h
> > +++ b/src/include/executor/hashjoin.h
> > @@ -25,10 +25,14 @@
> > *
> > * Each active hashjoin has a HashJoinTable control block, which is
> > * palloc'd in the executor's per-query context. All other storage needed
> > - * for the hashjoin is kept in private memory contexts, two for each
> > hashjoin.
> > + * for the hashjoin is kept in private memory contexts, three for each
> > + * hashjoin:
>
> Maybe "hash table control block". I know the phrase "control block" is
> used elsewhere in the comments, but it is a bit confusing. Also, I wish
> there was a way to make it clear this is for the hashtable but relevant
> to all batches.

I tried to reword the comment with this additional info in mind in v6. Does it
match what you had in mind?

> ...
> So, if we are going to allocate the array of pointers to the spill files
> in the fileCxt, we should revise this comment. As I mentioned above, I
> agree with you that if the SharedTupleStore-related buffers are also
> allocated in this context, perhaps it shouldn't be called the fileCxt.
>
> One idea I had is calling it the spillCxt. Almost all memory allocated in this
> context is related to needing to spill to permanent storage during execution.

Agree

> The one potential confusing part of this is batch 0 for parallel hash
> join. I would have to dig back into it again, but from a cursory look at
> ExecParallelHashJoinSetUpBatches() it seems like batch 0 also gets a
> shared tuplestore with associated accessors and files even if it is a
> single batch parallel hashjoin.
>
> Are the parallel hash join read_buffer and write_chunk also used for a
> single batch parallel hash join?

I don't think so.

For the inner side, there's various Assert() around the batchno==0 special
case. Plus, it always has his own block when inserting in a batch, to directly
write in shared memory calling ExecParallelHashPushTuple().

The outer side of the join actually creates all batches using shared tuple
storage mechanism, including batch 0, **only** if the number of batch is
greater than 1. See in ExecParallelHashJoinOuterGetTuple:

/*
* In the Parallel Hash case we only run the outer plan directly for
* single-batch hash joins. Otherwise we have to go to batch files, even
* for batch 0.
*/
if (curbatch == 0 && hashtable->nbatch == 1)
{
slot = ExecProcNode(outerNode);

So, for a single batch PHJ, it seems there's no temp files involved.

> Though, perhaps spillCxt still makes sense? It doesn't specify
> multi-batch...

I'm not sure the see where would be the confusing part here? Is it that some
STS mechanism are allocated but never used? When the number of batch is 1, it
doesn't really matter much I suppose, as the context consumption stays
really low. Plus, there's some other useless STS/context around there (ie. inner
batch 0 and batch context in PHJ). I'm not sure it worth trying optimizing this
compare to the cost of the added code complexity.

Or am I off-topic and missing something obvious?

> > --- a/src/include/executor/nodeHashjoin.h
> > +++ b/src/include/executor/nodeHashjoin.h
> > @@ -29,6 +29,6 @@ extern void ExecHashJoinInitializeWorker(HashJoinState
> > *state, ParallelWorkerContext *pwcxt);
> >
>
> I would add a comment explaining why ExecHashJoinSaveTuple() is passed
> with the fileCxt as a parameter.
>
> > extern void ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,

Isn't the comment added in the function itself, in v6, enough? It seems
uncommon to comment on function parameters in headers.

Last, about your TODO in 0001 patch, do you mean that we should document
that after splitting a batch N, its rows can only redispatch in N0 or N1 ?

Regards,

Attachment Content-Type Size
0001-v6-Describe-hybrid-hash-join-implementation.patch text/x-patch 2.8 KB
0002-v6-Allocate-hash-batches-related-BufFile-in-a-dedicated.patch text/x-patch 12.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-05-10 12:58:09 Re: smgrzeroextend clarification
Previous Message Bharath Rupireddy 2023-05-10 12:05:25 Re: Time delayed LR (WAS Re: logical replication restrictions)