Re: POC: Sharing record typmods between backends

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Sharing record typmods between backends
Date: 2017-08-12 01:55:04
Message-ID: 20170812015504.rgq7ljlwanhslrll@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2017-08-11 20:39:13 +1200, Thomas Munro wrote:
> Please find attached a new patch series. I apologise in advance for
> 0001 and note that the patchset now weighs in at ~75kB compressed.
> Here are my in-line replies to your two reviews:

Replying to a few points here, then I'll do a pass through your your
submission...

> On Tue, Jul 25, 2017 at 10:09 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > It does concern me that we're growing yet another somewhat different
> > hashtable implementation. Yet I don't quite see how we could avoid
> > it. dynahash relies on proper pointers, simplehash doesn't do locking
> > (and shouldn't) and also relies on pointers, although to a much lesser
> > degree. All the open coded tables aren't a good match either. So I
> > don't quite see an alternative, but I'd love one.
>
> Yeah, I agree. To deal with data structures with different pointer
> types, locking policy, inlined hash/eq functions etc, perhaps there is
> a way we could eventually do 'policy based design' using the kind of
> macro trickery you started where we generate N different hash table
> variations but only have to maintain source code for one chaining hash
> table implementation? Or perl scripts that effectively behave as a
> cfront^H^H^H nevermind. I'm not planning to investigate that for this
> cycle.

Whaaa, what have I done!!!! But more seriously, I'm doubtful it's worth
going there.

> > + * level. However, when a resize operation begins, all partition locks must
> > + * be acquired simultaneously for a brief period. This is only expected to
> > + * happen a small number of times until a stable size is found, since growth is
> > + * geometric.
> >
> > I'm a bit doubtful that we need partitioning at this point, and that it
> > doesn't actually *degrade* performance for your typmod case.
>
> Yeah, partitioning not needed for this case, but this is supposed to
> be more generally useful. I thought about making the number of
> partitions a construction parameter, but it doesn't really hurt does
> it?

Well, using multiple locks and such certainly isn't free. An exclusively
owned cacheline mutex is nearly an order of magnitude faster than one
that's currently shared, not to speak of modified. Also it does increase
the size overhead, which might end up happening for a few other cases.

> > + * Resizing is done incrementally so that no individual insert operation pays
> > + * for the potentially large cost of splitting all buckets.
> >
> > I'm not sure this is a reasonable tradeoff for the use-case suggested so
> > far, it doesn't exactly make things simpler. We're not going to grow
> > much.
>
> Yeah, designed to be more generally useful. Are you saying you would
> prefer to see the DHT patch split into an initial submission that does
> the simplest thing possible, so that the unlucky guy who causes the
> hash table to grow has to do all the work of moving buckets to a
> bigger hash table? Then we could move the more complicated
> incremental growth stuff to a later patch.

Well, most of the potential usecases for dsmhash I've heard about so
far, don't actually benefit much from incremental growth. In nearly all
the implementations I've seen incremental move ends up requiring more
total cycles than doing it at once, and for parallelism type usecases
the stall isn't really an issue. So yes, I think this is something
worth considering. If we were to actually use DHT for shared caches or
such, this'd be different, but that seems darned far off.

> This is complicated, and in the category that I would normally want a
> stack of heavy unit tests for. If you don't feel like making
> decisions about this now, perhaps iteration (and incremental resize?)
> could be removed, leaving only the most primitive get/put hash table
> facilities -- just enough for this purpose? Then a later patch could
> add them back, with a set of really convincing unit tests...

I'm inclined to go for that, yes.

> > +/*
> > + * Detach from a hash table. This frees backend-local resources associated
> > + * with the hash table, but the hash table will continue to exist until it is
> > + * either explicitly destroyed (by a backend that is still attached to it), or
> > + * the area that backs it is returned to the operating system.
> > + */
> > +void
> > +dht_detach(dht_hash_table *hash_table)
> > +{
> > + /* The hash table may have been destroyed. Just free local memory. */
> > + pfree(hash_table);
> > +}
> >
> > Somewhat inclined to add debugging refcount - seems like bugs around
> > that might be annoying to find. Maybe also add an assert ensuring that
> > no locks are held?
>
> Added assert that not locks are held.
>
> In an earlier version I had reference counts. Then I realised that it
> wasn't really helping anything. The state of being 'attached' to a
> dht_hash_table isn't really the same as holding a heavyweight resource
> like a DSM segment or a file which is backed by kernel resources.
> 'Attaching' is just something you have to do to get a backend-local
> palloc()-ated object required to interact with the hash table, and
> since it's just a bit of memory there is no strict requirement to
> detach from it, if you're happy to let MemoryContext do that for you.
> To put it in GC terms, there is no important finalizer here. Here I
> am making the same distinction that we make between stuff managed by
> resowner.c (files etc) and stuff managed by MemoryContext (memory); in
> the former case it's an elog()-gable offence not to close things
> explicitly in non-error paths, but in the latter you're free to do
> that, or pfree earlier. If in future we create more things that can
> live in DSA memory, I'd like them to be similarly free-and-easy. Make
> sense?

I don't quite follow. You're saying that just because there could be
local bugs (which'd easily be found via the mcxt/aset debugging stuff
and/or valgrind) making sure about the shared resources still being
there isn't useful? I don't quite find that convincing...

> > +/*
> > + * Look up an entry, given a key. Returns a pointer to an entry if one can be
> > + * found with the given key. Returns NULL if the key is not found. If a
> > + * non-NULL value is returned, the entry is locked and must be released by
> > + * calling dht_release. If an error is raised before dht_release is called,
> > + * the lock will be released automatically, but the caller must take care to
> > + * ensure that the entry is not left corrupted. The lock mode is either
> > + * shared or exclusive depending on 'exclusive'.
> >
> > This API seems a bit fragile.
>
> Do you mean "... the caller must take care to ensure that the entry is
> not left corrupted"?

Yes.

> This is the same as anything protected by LWLocks including shared
> buffers. If you error out, locks are released and you had better not
> have left things in a bad state. I guess this comment is really just
> about what C++ people call "basic exception safety".

Kind. Although it's not impossible to make this bit less error
prone. E.g. by zeroing the entry before returning.

Now that I think about it, it's possibly also worthwhile to note that
any iterators and such are invalid after errors (given that ->locked is
going to be wrong)?

> > diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
> > index 9fd7b4e019b..97c0125a4ba 100644
> > --- a/src/backend/access/common/tupdesc.c
> > +++ b/src/backend/access/common/tupdesc.c
> > @@ -337,17 +337,75 @@ DecrTupleDescRefCount(TupleDesc tupdesc)
> > {
> > Assert(tupdesc->tdrefcount > 0);
> >
> > - ResourceOwnerForgetTupleDesc(CurrentResourceOwner, tupdesc);
> > + if (CurrentResourceOwner != NULL)
> > + ResourceOwnerForgetTupleDesc(CurrentResourceOwner, tupdesc);
> > if (--tupdesc->tdrefcount == 0)
> > FreeTupleDesc(tupdesc);
> > }
> >
> > What's this about? CurrentResourceOwner should always be valid here, no?
> > If so, why did that change? I don't think it's good to detach this from
> > the resowner infrastructure...
>
> The reason is that I install a detach hook
> shared_record_typmod_registry_detach() in worker processes to clear
> out their typmod registry. It runs at a time when there is no
> CurrentResourceOwner. It's a theoretical concern only today, because
> workers are not reused. If a workers lingered in a waiting room and
> then attached to a new session DSM from a different leader, then it
> needs to remember nothing of the previous leader's typmods.

Hm. I'm not sure I like adhoc code like this. Wouldn't it be better to
have a 'per worker' rather than 'per transaction" resowner for things
like this? Otherwise we end up with various datastructures to keep track
of things.

> > /*
> > - * Magic numbers for parallel state sharing. Higher-level code should use
> > - * smaller values, leaving these very large ones for use by this module.
> > + * Magic numbers for per-context parallel state sharing. Higher-level code
> > + * should use smaller values, leaving these very large ones for use by this
> > + * module.
> > */
> > #define PARALLEL_KEY_FIXED UINT64CONST(0xFFFFFFFFFFFF0001)
> > #define PARALLEL_KEY_ERROR_QUEUE UINT64CONST(0xFFFFFFFFFFFF0002)
> > @@ -63,6 +74,16 @@
> > #define PARALLEL_KEY_ACTIVE_SNAPSHOT UINT64CONST(0xFFFFFFFFFFFF0007)
> > #define PARALLEL_KEY_TRANSACTION_STATE UINT64CONST(0xFFFFFFFFFFFF0008)
> > #define PARALLEL_KEY_ENTRYPOINT UINT64CONST(0xFFFFFFFFFFFF0009)
> > +#define PARALLEL_KEY_SESSION_DSM UINT64CONST(0xFFFFFFFFFFFF000A)
> > +
> > +/* Magic number for per-session DSM TOC. */
> > +#define PARALLEL_SESSION_MAGIC 0xabb0fbc9
> > +
> > +/*
> > + * Magic numbers for parallel state sharing in the per-session DSM area.
> > + */
> > +#define PARALLEL_KEY_SESSION_DSA UINT64CONST(0xFFFFFFFFFFFF0001)
> > +#define PARALLEL_KEY_RECORD_TYPMOD_REGISTRY UINT64CONST(0xFFFFFFFFFFFF0002)
> >
> > Not this patch's fault, but this infrastructure really isn't great. We
> > should really replace it with a shmem.h style infrastructure, using a
> > dht hashtable as backing...
>
> Well, I am trying to use the established programming style. We
> already have a per-query DSM with a TOC indexed by magic numbers (and
> executor node IDs). I add a per-session DSM with a TOC indexed by a
> different set of magic numbers. We could always come up with
> something better and fix it in both places later?

I guess so. I know Robert is a bit tired of me harping about this, but I
really don't think this is great...

> > +/*
> > + * A flattened/serialized representation of a TupleDesc for use in shared
> > + * memory. Can be converted to and from regular TupleDesc format. Doesn't
> > + * support constraints and doesn't store the actual type OID, because this is
> > + * only for use with RECORD types as created by CreateTupleDesc(). These are
> > + * arranged into a linked list, in the hash table entry corresponding to the
> > + * OIDs of the first 16 attributes, so we'd expect to get more than one entry
> > + * in the list when named and other properties differ.
> > + */
> > +typedef struct SerializedTupleDesc
> > +{
> > + dsa_pointer next; /* next with the same same attribute OIDs */
> > + int natts; /* number of attributes in the tuple */
> > + int32 typmod; /* typmod for tuple type */
> > + bool hasoid; /* tuple has oid attribute in its header */
> > +
> > + /*
> > + * The attributes follow. We only ever access the first
> > + * ATTRIBUTE_FIXED_PART_SIZE bytes of each element, like the code in
> > + * tupdesc.c.
> > + */
> > + FormData_pg_attribute attributes[FLEXIBLE_ARRAY_MEMBER];
> > +} SerializedTupleDesc;
> >
> > Not a fan of a separate tupledesc representation, that's just going to
> > lead to divergence over time. I think we should rather change the normal
> > tupledesc representation to be compatible with this, and 'just' have a
> > wrapper struct for the parallel case (with next and such).
>
> OK. I killed this. Instead I flattened tupleDesc to make it usable
> directly in shared memory as long as there are no constraints. There
> is still a small wrapper SharedTupleDesc, but that's just to bolt a
> 'next' pointer to them so we can chain together TupleDescs with the
> same OIDs.

Yep, that makes sense.

> > +/*
> > + * An entry in SharedRecordTypmodRegistry's attribute index. The key is the
> > + * first REC_HASH_KEYS attribute OIDs. That means that collisions are
> > + * possible, but that's OK because SerializedTupleDesc objects are arranged
> > + * into a list.
> > + */
> >
> > +/* Parameters for SharedRecordTypmodRegistry's attributes hash table. */
> > +const static dht_parameters srtr_atts_index_params = {
> > + sizeof(Oid) * REC_HASH_KEYS,
> > + sizeof(SRTRAttsIndexEntry),
> > + memcmp,
> > + tag_hash,
> > + LWTRANCHE_SHARED_RECORD_ATTS_INDEX
> > +};
> > +
> > +/* Parameters for SharedRecordTypmodRegistry's typmod hash table. */
> > +const static dht_parameters srtr_typmod_index_params = {
> > + sizeof(uint32),
> > + sizeof(SRTRTypmodIndexEntry),
> > + memcmp,
> > + tag_hash,
> > + LWTRANCHE_SHARED_RECORD_TYPMOD_INDEX
> > +};
> > +
> >
> > I'm very much not a fan of this representation. I know you copied the
> > logic, but I think it's a bad idea. I think the key should just be a
> > dsa_pointer, and then we can have a proper tag_hash that hashes the
> > whole thing, and a proper comparator too. Just have
> >
> > /*
> > * Combine two hash values, resulting in another hash value, with decent bit
> > * mixing.
> > *
> > * Similar to boost's hash_combine().
> > */
> > static inline uint32
> > hash_combine(uint32 a, uint32 b)
> > {
> > a ^= b + 0x9e3779b9 + (a << 6) + (a >> 2);
> > return a;
> > }
> >
> > and then hash everything.
>
> Hmm. I'm not sure I understand. I know what hash_combine is for but
> what do you mean when you say they key should just be a dsa_pointer?

> What's wrong with providing the key size, whole entry size, compare
> function and hash function like this?

Well, right now the key is "sizeof(Oid) * REC_HASH_KEYS" which imo is
fairly ugly. Both because it wastes space for narrow cases, and because
it leads to conflicts for wide ones. By having a dsa_pointer as a key
and custom hash/compare functions there's no need for that, you can just
compute the hash based on all keys, and compare based on all keys.

> > +/*
> > + * Make sure that RecordCacheArray is large enough to store 'typmod'.
> > + */
> > +static void
> > +ensure_record_cache_typmod_slot_exists(int32 typmod)
> > +{
> > + if (RecordCacheArray == NULL)
> > + {
> > + RecordCacheArray = (TupleDesc *)
> > + MemoryContextAllocZero(CacheMemoryContext, 64 * sizeof(TupleDesc));
> > + RecordCacheArrayLen = 64;
> > + }
> > +
> > + if (typmod >= RecordCacheArrayLen)
> > + {
> > + int32 newlen = RecordCacheArrayLen * 2;
> > +
> > + while (typmod >= newlen)
> > + newlen *= 2;
> > +
> > + RecordCacheArray = (TupleDesc *) repalloc(RecordCacheArray,
> > + newlen * sizeof(TupleDesc));
> > + memset(RecordCacheArray + RecordCacheArrayLen, 0,
> > + (newlen - RecordCacheArrayLen) * sizeof(TupleDesc *));
> > + RecordCacheArrayLen = newlen;
> > + }
> > +}
> >
> > Do we really want to keep this? Could just have an equivalent dynahash
> > for the non-parallel case?
>
> Hmm. Well the plain old array makes a lot of sense in the
> non-parallel case, since we allocate typmods starting from zero. What
> don't you like about it? The reason for using an array for
> backend-local lookup (aside from "that's how it is already") is that
> it's actually the best data structure for the job; the reason for
> using a hash table in the shared case is that it gives you locking and
> coordinates growth for free. (For the OID index it has to be a hash
> table in both cases.)

Well, that reason kinda vanished after the parallelism introduction, no?
There's no guarantee at all anymore that it's gapless - it's perfectly
possible that each worker ends up with a distinct set of ids.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2017-08-12 03:56:22 Re: [HACKERS] Replication to Postgres 10 on Windows is broken
Previous Message unixway.drive 2017-08-12 01:33:36 pg_stat_statements query normalization, and the 'in' operator