Re: POC: Sharing record typmods between backends

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Sharing record typmods between backends
Date: 2017-09-04 06:14:39
Message-ID: CAEepm=2Vs5iR4MO4LWe3Ap0jiQngoT3jrSAGPV3BiGd3i3n6ig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the review and commits so far. Here's a rebased, debugged
and pgindented version of the remaining patches. I ran pgindent with
--list-of-typedefs="SharedRecordTableKey,SharedRecordTableEntry,SharedTypmodTableEntry,SharedRecordTypmodRegistry,Session"
to fix some weirdness around these new typenames.

While rebasing the 0002 patch (removal of tqueue.c's remapping logic),
I modified the interface of the newly added
ExecParallelCreateReaders() function from commit 51daa7bd because it
no longer has any reason to take a TupleDesc.

On Fri, Aug 25, 2017 at 1:46 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2017-08-21 11:02:52 +1200, Thomas Munro wrote:
>> 2. Andres didn't like what I did to DecrTupleDescRefCount, namely
>> allowing to run when there is no ResourceOwner. I now see that this
>> is probably an indication of a different problem; even if there were a
>> worker ResourceOwner as he suggested (or perhaps a session-scoped one,
>> which a worker would reset before being reused), it wouldn't be the
>> one that was active when the TupleDesc was created. I think I have
>> failed to understand the contracts here and will think/read about it
>> some more.
>
> Maybe I'm missing something, but isn't the issue here that using
> DecrTupleDescRefCount() simply is wrong, because we're not actually
> necessarily tracking the TupleDesc via the resowner mechanism?

Yeah. Thanks.

> If you look at the code, in the case it's a previously unknown tupledesc
> it's registered with:
>
> entDesc = CreateTupleDescCopy(tupDesc);
> ...
> /* mark it as a reference-counted tupdesc */
> entDesc->tdrefcount = 1;
> ...
> RecordCacheArray[newtypmod] = entDesc;
> ...
>
> Note that there's no PinTupleDesc(), IncrTupleDescRefCount() or
> ResourceOwnerRememberTupleDesc() managing the reference from the
> array. Nor was there one before.
>
> We have other code managing TupleDesc lifetimes similarly, and look at
> how they're freeing it:
> /* Delete tupdesc if we have it */
> if (typentry->tupDesc != NULL)
> {
> /*
> * Release our refcount, and free the tupdesc if none remain.
> * (Can't use DecrTupleDescRefCount because this reference is not
> * logged in current resource owner.)
> */
> Assert(typentry->tupDesc->tdrefcount > 0);
> if (--typentry->tupDesc->tdrefcount == 0)
> FreeTupleDesc(typentry->tupDesc);
> typentry->tupDesc = NULL;
> }

Right. I have changed shared_record_typmod_registry_worker_detach()
to be more like that, with an explanation.

> This also made me think about how we're managing the lookup from the
> shared array:
>
> /*
> * Our local array can now point directly to the TupleDesc
> * in shared memory.
> */
> RecordCacheArray[typmod] = tupdesc;
>
> Uhm. Isn't that highly highly problematic? E.g. tdrefcount manipulations
> which are done by all lookups (cf. lookup_rowtype_tupdesc()) would in
> that case manipulate shared memory in a concurrency unsafe manner.

No. See this change, in that and similar code paths:

- IncrTupleDescRefCount(tupDesc);
+ PinTupleDesc(tupDesc);

The difference between IncrTupleDescRefCount() and PinTupleDesc() is
that the latter recognises non-refcounted tuple descriptors
(tdrefcount == -1) and does nothing. Shared tuple descriptors are not
reference counted (see TupleDescCopy() which initialises
dst->tdrefcount to -1). It was for foolish symmetry that I was trying
to use ReleaseTupleDesc() in shared_record_typmod_registry_detach()
before, since it also knows about non-refcounted tuple descriptors,
but that's not appropriate: it calls DecrTupleDescRefCount() which
assumes that we're using resource owners. We're not.

To summarise the object lifetime management situation created by this
patch: shared TupleDesc objects accumulate in per-session DSM memory
until eventually the session ends and the DSM memory goes away. A bit
like CacheMemoryContext: there is no retail cleanup of shared
TupleDesc objects. BUT: the DSM detach callback is used to clear out
backend-local pointers to that stuff (and any non-shared reference
counted TupleDesc objects that might be found), in anticipation of
being able to reuse a worker process one day (which will involve
attaching to a new session, so we mustn't retain any traces of the
previous session in our local state). Maybe I'm trying to be a little
too clairvoyant there...

I improved the cleanup code: now it frees RecordCacheArray and
RecordCacheHash and reinstalls NULL pointers. Also it deals with
errors in GetSessionDsmHandle() better.

I renamed the members of Session to include a "shared_" prefix, which
seems a bit clearer.

I refactored it so that it never makes needless local copies of
TupleDesc objects (previously assign_record_type_typmod() would create
an extra local copy and cache that, which was wasteful). That
actually makes much of the discussion above moot: on detach, a worker
should now ONLY find shared non-refcounted TupleDesc objects in the
local caches, so the FreeTupleDesc() case is unreachable...

The leader on the other hand can finish up with a mixture of local and
shared TupleDesc objects in its cache, if it had some before it ran a
parallel query. Its detach hook doesn't try to free those so it
doesn't matter.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
shared-record-typmods-v10.patchset.tgz application/x-gzip 22.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-09-04 06:38:32 Re: CLUSTER command progress monitor
Previous Message Michael Paquier 2017-09-04 06:14:00 Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()