Re: POC: Sharing record typmods between backends

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Sharing record typmods between backends
Date: 2017-08-11 08:39:13
Message-ID: CAEepm=0R4t-_Jrk9hWaOj9JLdQc2kzdkVZXXCkqd=XFGs41hzw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Please find attached a new patch series. I apologise in advance for
0001 and note that the patchset now weighs in at ~75kB compressed.
Here are my in-line replies to your two reviews:

On Tue, Jul 25, 2017 at 10:09 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> It does concern me that we're growing yet another somewhat different
> hashtable implementation. Yet I don't quite see how we could avoid
> it. dynahash relies on proper pointers, simplehash doesn't do locking
> (and shouldn't) and also relies on pointers, although to a much lesser
> degree. All the open coded tables aren't a good match either. So I
> don't quite see an alternative, but I'd love one.

Yeah, I agree. To deal with data structures with different pointer
types, locking policy, inlined hash/eq functions etc, perhaps there is
a way we could eventually do 'policy based design' using the kind of
macro trickery you started where we generate N different hash table
variations but only have to maintain source code for one chaining hash
table implementation? Or perl scripts that effectively behave as a
cfront^H^H^H nevermind. I'm not planning to investigate that for this
cycle.

>
> diff --git a/src/backend/lib/dht.c b/src/backend/lib/dht.c
> new file mode 100644
> index 00000000000..2fec70f7742
> --- /dev/null
> +++ b/src/backend/lib/dht.c
>
> FWIW, not a big fan of dht as a filename (nor of dsa.c). For one DHT
> usually refers to distributed hash tables, which this is not, and for
> another the abbreviation is so short it's not immediately
> understandable, and likely to conflict further. I think it'd possibly
> ok to have dht as symbol prefixes, but rename the file to be longer.

OK. Now it's ds_hash_table.{c,h}, where "ds" stands for "dynamic
shared". Better? If we were to do other data structures in DSA
memory they could follow that style: ds_red_black_tree.c, ds_vector.c,
ds_deque.c etc and their identifier prefix would be drbt_, dv_, dd_
etc.

Do you want to see a separate patch to rename dsa.c? Got a better
name? You could have spoken up earlier :-) It does sound like a bit
like the thing from crypto or perhaps a scary secret government
department.

> + * To deal with currency, it has a fixed size set of partitions, each of which
> + * is independently locked.
>
> s/currency/concurrency/ I presume.

Fixed.

> + * Each bucket maps to a partition; so insert, find
> + * and iterate operations normally only acquire one lock. Therefore, good
> + * concurrency is achieved whenever they don't collide at the lock partition
>
> s/they/operations/?

Fixed.

> + * level. However, when a resize operation begins, all partition locks must
> + * be acquired simultaneously for a brief period. This is only expected to
> + * happen a small number of times until a stable size is found, since growth is
> + * geometric.
>
> I'm a bit doubtful that we need partitioning at this point, and that it
> doesn't actually *degrade* performance for your typmod case.

Yeah, partitioning not needed for this case, but this is supposed to
be more generally useful. I thought about making the number of
partitions a construction parameter, but it doesn't really hurt does
it?

> + * Resizing is done incrementally so that no individual insert operation pays
> + * for the potentially large cost of splitting all buckets.
>
> I'm not sure this is a reasonable tradeoff for the use-case suggested so
> far, it doesn't exactly make things simpler. We're not going to grow
> much.

Yeah, designed to be more generally useful. Are you saying you would
prefer to see the DHT patch split into an initial submission that does
the simplest thing possible, so that the unlucky guy who causes the
hash table to grow has to do all the work of moving buckets to a
bigger hash table? Then we could move the more complicated
incremental growth stuff to a later patch.

> +/* The opaque type used for tracking iterator state. */
> +struct dht_iterator;
> +typedef struct dht_iterator dht_iterator;
>
> Isn't it actually the iterator state? Rather than tracking it? Also, why
> is it opaque given you're actually defining it below? Guess you'd moved
> it at some point.

Improved comment. The iterator state is defined below in the .h, but
with a warning that client code mustn't access it; it exists in the
header only because it's very useful to be able to but dht_iterator on
the stack which requires the client code to have its definition, but I
want to reserve the right to change it arbitrarily in future.

> +/*
> + * The set of parameters needed to create or attach to a hash table. The
> + * members tranche_id and tranche_name do not need to be initialized when
> + * attaching to an existing hash table.
> + */
> +typedef struct
> +{
> + Size key_size; /* Size of the key (initial bytes of entry) */
> + Size entry_size; /* Total size of entry */
>
> Let's use size_t, like we kind of concluded in the thread you started:
> http://archives.postgresql.org/message-id/25076.1489699457%40sss.pgh.pa.us
> :)

Sold.

> + dht_compare_function compare_function; /* Compare function */
> + dht_hash_function hash_function; /* Hash function */
>
> Might be worth explaining that these need to provided when attaching
> because they're possibly process local. Did you test this with
> EXEC_BACKEND?

Added explanation. I haven't personally tested with EXEC_BACKEND but
I believe one of my colleagues had something else that uses DHT this
running on a Windows box and didn't shout at me, and I see no reason
to think it shouldn't work: as explained in the new comment, every
attacher has to supply the function pointers from their own process
space (and standard footgun rules apply if you don't supply compatible
functions).

> + int tranche_id; /* The tranche ID to use for locks. */
> +} dht_parameters;
>
>
> +struct dht_iterator
> +{
> + dht_hash_table *hash_table; /* The hash table we are iterating over. */
> + bool exclusive; /* Whether to lock buckets exclusively. */
> + Size partition; /* The index of the next partition to visit. */
> + Size bucket; /* The index of the next bucket to visit. */
> + dht_hash_table_item *item; /* The most recently returned item. */
> + dsa_pointer last_item_pointer; /* The last item visited. */
> + Size table_size_log2; /* The table size when we started iterating. */
> + bool locked; /* Whether the current partition is locked. */
>
> Haven't gotten to the actual code yet, but this kinda suggest we leave a
> partition locked when iterating? Hm, that seems likely to result in a
> fair bit of pain...

By default yes, but you can release the lock with
dht_iterate_release_lock() and it'll be reacquired when you call
dht_iterate_next(). If you do that, then you'll continue iterating
after where you left off without visiting any item that you've already
visited, because the pointers are stored in pointer order (even though
the most recently visited item may have been freed rendering the
pointer invalid, we can still use its pointer to skip everything
already visited by numerical comparison without dereferencing it, and
it's indeterminate whether anything added while you were unlocked is
visible to you).

Maintaining linked lists in a certain order sucks, but DHT doesn't
allow duplicate keys and grows when load factor exceeds X so unless
your hash function is busted...

This is complicated, and in the category that I would normally want a
stack of heavy unit tests for. If you don't feel like making
decisions about this now, perhaps iteration (and incremental resize?)
could be removed, leaving only the most primitive get/put hash table
facilities -- just enough for this purpose? Then a later patch could
add them back, with a set of really convincing unit tests...

> +/* Iterating over the whole hash table. */
> +extern void dht_iterate_begin(dht_hash_table *hash_table,
> + dht_iterator *iterator, bool exclusive);
> +extern void *dht_iterate_next(dht_iterator *iterator);
> +extern void dht_iterate_delete(dht_iterator *iterator);
>
> s/delete/delete_current/? Otherwise it looks like it's part of
> manipulating just the iterator.

Done.

> +extern void dht_iterate_release(dht_iterator *iterator);
>
> I'd add lock to to the name.

Done.

> +/*
> + * An item in the hash table. This wraps the user's entry object in an
> + * envelop that holds a pointer back to the bucket and a pointer to the next
> + * item in the bucket.
> + */
> +struct dht_hash_table_item
> +{
> + /* The hashed key, to avoid having to recompute it. */
> + dht_hash hash;
> + /* The next item in the same bucket. */
> + dsa_pointer next;
> + /* The user's entry object follows here. */
> + char entry[FLEXIBLE_ARRAY_MEMBER];
>
> What's the point of using FLEXIBLE_ARRAY_MEMBER here? And isn't using a
> char going to lead to alignment problems?

Fixed. No longer using a member 'entry', just a comment that user
data follows and a macro to find it based on
MAXALIGN(sizeof(dht_hash_table_item)).

> +/* The number of partitions for locking purposes. */
> +#define DHT_NUM_PARTITIONS_LOG2 7
>
> Could use some justification.

Added. Short version: if it's good enough for the buffer pool...

> +/*
> + * The head object for a hash table. This will be stored in dynamic shared
> + * memory.
> + */
> +typedef struct
> +{
>
> Why anonymous? Not that it hurts much, but seems weird to deviate just
> here.

Fixed.

> +/*
> + * Create a new hash table backed by the given dynamic shared area, with the
> + * given parameters.
> + */
> +dht_hash_table *
> +dht_create(dsa_area *area, const dht_parameters *params)
> +{
> + dht_hash_table *hash_table;
> + dsa_pointer control;
> +
> + /* Allocate the backend-local object representing the hash table. */
> + hash_table = palloc(sizeof(dht_hash_table));
>
> Should be documented that this uses caller's MemoryContext.

Done.

> + /* Set up the array of lock partitions. */
> + {
> + int i;
> +
> + for (i = 0; i < DHT_NUM_PARTITIONS; ++i)
> + {
> + LWLockInitialize(PARTITION_LOCK(hash_table, i),
> + hash_table->control->lwlock_tranche_id);
> + hash_table->control->partitions[i].count = 0;
> + }
>
> I'd store hash_table->control->lwlock_tranche_id and partitions[i] in
> local vars. Possibly hash_table->control too.

Tidied up. I made local vars for partitions and tranche_id.

> +/*
> + * Detach from a hash table. This frees backend-local resources associated
> + * with the hash table, but the hash table will continue to exist until it is
> + * either explicitly destroyed (by a backend that is still attached to it), or
> + * the area that backs it is returned to the operating system.
> + */
> +void
> +dht_detach(dht_hash_table *hash_table)
> +{
> + /* The hash table may have been destroyed. Just free local memory. */
> + pfree(hash_table);
> +}
>
> Somewhat inclined to add debugging refcount - seems like bugs around
> that might be annoying to find. Maybe also add an assert ensuring that
> no locks are held?

Added assert that not locks are held.

In an earlier version I had reference counts. Then I realised that it
wasn't really helping anything. The state of being 'attached' to a
dht_hash_table isn't really the same as holding a heavyweight resource
like a DSM segment or a file which is backed by kernel resources.
'Attaching' is just something you have to do to get a backend-local
palloc()-ated object required to interact with the hash table, and
since it's just a bit of memory there is no strict requirement to
detach from it, if you're happy to let MemoryContext do that for you.
To put it in GC terms, there is no important finalizer here. Here I
am making the same distinction that we make between stuff managed by
resowner.c (files etc) and stuff managed by MemoryContext (memory); in
the former case it's an elog()-gable offence not to close things
explicitly in non-error paths, but in the latter you're free to do
that, or pfree earlier. If in future we create more things that can
live in DSA memory, I'd like them to be similarly free-and-easy. Make
sense?

In any case this use user of DHT remains attached for the backend's lifetime.

> +/*
> + * Look up an entry, given a key. Returns a pointer to an entry if one can be
> + * found with the given key. Returns NULL if the key is not found. If a
> + * non-NULL value is returned, the entry is locked and must be released by
> + * calling dht_release. If an error is raised before dht_release is called,
> + * the lock will be released automatically, but the caller must take care to
> + * ensure that the entry is not left corrupted. The lock mode is either
> + * shared or exclusive depending on 'exclusive'.
>
> This API seems a bit fragile.

Do you mean "... the caller must take care to ensure that the entry is
not left corrupted"? This is the same as anything protected by
LWLocks including shared buffers. If you error out, locks are
released and you had better not have left things in a bad state. I
guess this comment is really just about what C++ people call "basic
exception safety".

Or something else?

> +/*
> + * Returns a pointer to an exclusively locked item which must be released with
> + * dht_release. If the key is found in the hash table, 'found' is set to true
> + * and a pointer to the existing entry is returned. If the key is not found,
> + * 'found' is set to false, and a pointer to a newly created entry is related.
>
> "is related"?

Fixed.

> + */
> +void *
> +dht_find_or_insert(dht_hash_table *hash_table,
> + const void *key,
> + bool *found)
> +{
> + size_t hash;
> + size_t partition_index;
> + dht_partition *partition;
> + dht_hash_table_item *item;
> +
> + hash = hash_table->params.hash_function(key, hash_table->params.key_size);
> + partition_index = PARTITION_FOR_HASH(hash);
> + partition = &hash_table->control->partitions[partition_index];
> +
> + Assert(hash_table->control->magic == DHT_MAGIC);
> + Assert(!hash_table->exclusively_locked);
>
> Why just exclusively locked? Why'd shared be ok?

It wouldn't be OK. I just didn't have the state required to assert
that. Fixed.

I think in future it should be allowed to lock more than one partition
(conceptually more than one entry) at a time, but only after figuring
out a decent API to support doing that in deadlock-avoiding order. I
don't have a need or a plan for that yet. For the same reason it's
not OK to use dht_find[_or_insert] while any iterator has locked a
partition, which wasn't documented (is now) and isn't currently
assertable.

> +/*
> + * Unlock an entry which was locked by dht_find or dht_find_or_insert.
> + */
> +void
> +dht_release(dht_hash_table *hash_table, void *entry)
> +{
> + dht_hash_table_item *item = ITEM_FROM_ENTRY(entry);
> + size_t partition_index = PARTITION_FOR_HASH(item->hash);
> + bool deferred_resize_work = false;
> +
> + Assert(hash_table->control->magic == DHT_MAGIC);
>
> Assert lock held (LWLockHeldByMe())

Added this, and a couple more.

> +/*
> + * Begin iterating through the whole hash table. The caller must supply a
> + * dht_iterator object, which can then be used to call dht_iterate_next to get
> + * values until the end is reached.
> + */
> +void
> +dht_iterate_begin(dht_hash_table *hash_table,
> + dht_iterator *iterator,
> + bool exclusive)
> +{
> + Assert(hash_table->control->magic == DHT_MAGIC);
> +
> + iterator->hash_table = hash_table;
> + iterator->exclusive = exclusive;
> + iterator->partition = 0;
> + iterator->bucket = 0;
> + iterator->item = NULL;
> + iterator->last_item_pointer = InvalidDsaPointer;
> + iterator->locked = false;
> +
> + /* Snapshot the size (arbitrary lock to prevent size changing). */
> + LWLockAcquire(PARTITION_LOCK(hash_table, 0), LW_SHARED);
> + ensure_valid_bucket_pointers(hash_table);
> + iterator->table_size_log2 = hash_table->size_log2;
> + LWLockRelease(PARTITION_LOCK(hash_table, 0));
>
> Hm. So we're introducing some additional contention on partition 0 -
> probably ok.

It would be cute to use MyProcPid % DHT_NUM_PARTITIONS, but that might
be a deadlock hazard if you have multiple iterators on the go at once.
Otherwise iterators only ever lock partitions in order.

> +/*
> + * Move to the next item in the hash table. Returns a pointer to an entry, or
> + * NULL if the end of the hash table has been reached. The item is locked in
> + * exclusive or shared mode depending on the argument given to
> + * dht_iterate_begin. The caller can optionally release the lock by calling
> + * dht_iterate_release, and then call dht_iterate_next again to move to the
> + * next entry. If the iteration is in exclusive mode, client code can also
> + * call dht_iterate_delete. When the end of the hash table is reached, or at
> + * any time, the client may call dht_iterate_end to abandon iteration.
> + */
>
> I'd just shorten the end to "at any time the client may call
> dht_iterate_end to ..."

Done.

> [snip]
> +
> +/*
> + * Print out debugging information about the internal state of the hash table.
> + */
> +void
> +dht_dump(dht_hash_table *hash_table)
> +{
> + size_t i;
> + size_t j;
> +
> + Assert(hash_table->control->magic == DHT_MAGIC);
> +
> + for (i = 0; i < DHT_NUM_PARTITIONS; ++i)
> + LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);
>
> Should probably assert & document that no locks are held - otherwise
> there's going to be ugly deadlocks. And that's an unlikely thing to try.

OK.

> [snip]
> +}
>
> I'd put this below actual "production" code.

Done.

On Tue, Aug 1, 2017 at 9:08 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hi,
>
> diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
> index 9fd7b4e019b..97c0125a4ba 100644
> --- a/src/backend/access/common/tupdesc.c
> +++ b/src/backend/access/common/tupdesc.c
> @@ -337,17 +337,75 @@ DecrTupleDescRefCount(TupleDesc tupdesc)
> {
> Assert(tupdesc->tdrefcount > 0);
>
> - ResourceOwnerForgetTupleDesc(CurrentResourceOwner, tupdesc);
> + if (CurrentResourceOwner != NULL)
> + ResourceOwnerForgetTupleDesc(CurrentResourceOwner, tupdesc);
> if (--tupdesc->tdrefcount == 0)
> FreeTupleDesc(tupdesc);
> }
>
> What's this about? CurrentResourceOwner should always be valid here, no?
> If so, why did that change? I don't think it's good to detach this from
> the resowner infrastructure...

The reason is that I install a detach hook
shared_record_typmod_registry_detach() in worker processes to clear
out their typmod registry. It runs at a time when there is no
CurrentResourceOwner. It's a theoretical concern only today, because
workers are not reused. If a workers lingered in a waiting room and
then attached to a new session DSM from a different leader, then it
needs to remember nothing of the previous leader's typmods.

> /*
> - * Compare two TupleDesc structures for logical equality
> + * Compare two TupleDescs' attributes for logical equality
> *
> * Note: we deliberately do not check the attrelid and tdtypmod fields.
> * This allows typcache.c to use this routine to see if a cached record type
> * matches a requested type, and is harmless for relcache.c's uses.
> + */
> +bool
> +equalTupleDescAttrs(Form_pg_attribute attr1, Form_pg_attribute attr2)
> +{
>
> comment not really accurate, this routine afaik isn't used by
> typcache.c?

I removed this whole hunk and left equalTupleDescs() alone, because I
no longer needed to make that change in this new version. See below.

> /*
> - * Magic numbers for parallel state sharing. Higher-level code should use
> - * smaller values, leaving these very large ones for use by this module.
> + * Magic numbers for per-context parallel state sharing. Higher-level code
> + * should use smaller values, leaving these very large ones for use by this
> + * module.
> */
> #define PARALLEL_KEY_FIXED UINT64CONST(0xFFFFFFFFFFFF0001)
> #define PARALLEL_KEY_ERROR_QUEUE UINT64CONST(0xFFFFFFFFFFFF0002)
> @@ -63,6 +74,16 @@
> #define PARALLEL_KEY_ACTIVE_SNAPSHOT UINT64CONST(0xFFFFFFFFFFFF0007)
> #define PARALLEL_KEY_TRANSACTION_STATE UINT64CONST(0xFFFFFFFFFFFF0008)
> #define PARALLEL_KEY_ENTRYPOINT UINT64CONST(0xFFFFFFFFFFFF0009)
> +#define PARALLEL_KEY_SESSION_DSM UINT64CONST(0xFFFFFFFFFFFF000A)
> +
> +/* Magic number for per-session DSM TOC. */
> +#define PARALLEL_SESSION_MAGIC 0xabb0fbc9
> +
> +/*
> + * Magic numbers for parallel state sharing in the per-session DSM area.
> + */
> +#define PARALLEL_KEY_SESSION_DSA UINT64CONST(0xFFFFFFFFFFFF0001)
> +#define PARALLEL_KEY_RECORD_TYPMOD_REGISTRY UINT64CONST(0xFFFFFFFFFFFF0002)
>
> Not this patch's fault, but this infrastructure really isn't great. We
> should really replace it with a shmem.h style infrastructure, using a
> dht hashtable as backing...

Well, I am trying to use the established programming style. We
already have a per-query DSM with a TOC indexed by magic numbers (and
executor node IDs). I add a per-session DSM with a TOC indexed by a
different set of magic numbers. We could always come up with
something better and fix it in both places later?

> +/* The current per-session DSM segment, if attached. */
> +static dsm_segment *current_session_segment = NULL;
> +
>
> I think it'd be better if we had a proper 'SessionState' and
> 'BackendSessionState' infrastructure that then contains the dsm segment
> etc. I think we'll otherwise just end up with a bunch of parallel
> infrastructures.

I'll have to come back to you on this one.

> +/*
> + * A mechanism for sharing record typmods between backends.
> + */
> +struct SharedRecordTypmodRegistry
> +{
> + dht_hash_table_handle atts_index_handle;
> + dht_hash_table_handle typmod_index_handle;
> + pg_atomic_uint32 next_typmod;
> +};
> +
>
> I think the code needs to explain better how these are intended to be
> used. IIUC, atts_index is used to find typmods by "identity", and
> typmod_index by the typmod, right? And we need both to avoid
> all workers generating different tupledescs, right? Kinda guessable by
> reading typecache.c, but that shouldn't be needed.

Fixed.

> +/*
> + * A flattened/serialized representation of a TupleDesc for use in shared
> + * memory. Can be converted to and from regular TupleDesc format. Doesn't
> + * support constraints and doesn't store the actual type OID, because this is
> + * only for use with RECORD types as created by CreateTupleDesc(). These are
> + * arranged into a linked list, in the hash table entry corresponding to the
> + * OIDs of the first 16 attributes, so we'd expect to get more than one entry
> + * in the list when named and other properties differ.
> + */
> +typedef struct SerializedTupleDesc
> +{
> + dsa_pointer next; /* next with the same same attribute OIDs */
> + int natts; /* number of attributes in the tuple */
> + int32 typmod; /* typmod for tuple type */
> + bool hasoid; /* tuple has oid attribute in its header */
> +
> + /*
> + * The attributes follow. We only ever access the first
> + * ATTRIBUTE_FIXED_PART_SIZE bytes of each element, like the code in
> + * tupdesc.c.
> + */
> + FormData_pg_attribute attributes[FLEXIBLE_ARRAY_MEMBER];
> +} SerializedTupleDesc;
>
> Not a fan of a separate tupledesc representation, that's just going to
> lead to divergence over time. I think we should rather change the normal
> tupledesc representation to be compatible with this, and 'just' have a
> wrapper struct for the parallel case (with next and such).

OK. I killed this. Instead I flattened tupleDesc to make it usable
directly in shared memory as long as there are no constraints. There
is still a small wrapper SharedTupleDesc, but that's just to bolt a
'next' pointer to them so we can chain together TupleDescs with the
same OIDs.

The new 0001 patch changes tupdesc->attrs[i]->foo with
TupleDescAttr(tupdesc, i)->foo everywhere in the tree, so that the
change from attrs[i] to &attrs[i] can be hidden.

> +/*
> + * An entry in SharedRecordTypmodRegistry's attribute index. The key is the
> + * first REC_HASH_KEYS attribute OIDs. That means that collisions are
> + * possible, but that's OK because SerializedTupleDesc objects are arranged
> + * into a list.
> + */
>
> +/* Parameters for SharedRecordTypmodRegistry's attributes hash table. */
> +const static dht_parameters srtr_atts_index_params = {
> + sizeof(Oid) * REC_HASH_KEYS,
> + sizeof(SRTRAttsIndexEntry),
> + memcmp,
> + tag_hash,
> + LWTRANCHE_SHARED_RECORD_ATTS_INDEX
> +};
> +
> +/* Parameters for SharedRecordTypmodRegistry's typmod hash table. */
> +const static dht_parameters srtr_typmod_index_params = {
> + sizeof(uint32),
> + sizeof(SRTRTypmodIndexEntry),
> + memcmp,
> + tag_hash,
> + LWTRANCHE_SHARED_RECORD_TYPMOD_INDEX
> +};
> +
>
> I'm very much not a fan of this representation. I know you copied the
> logic, but I think it's a bad idea. I think the key should just be a
> dsa_pointer, and then we can have a proper tag_hash that hashes the
> whole thing, and a proper comparator too. Just have
>
> /*
> * Combine two hash values, resulting in another hash value, with decent bit
> * mixing.
> *
> * Similar to boost's hash_combine().
> */
> static inline uint32
> hash_combine(uint32 a, uint32 b)
> {
> a ^= b + 0x9e3779b9 + (a << 6) + (a >> 2);
> return a;
> }
>
> and then hash everything.

Hmm. I'm not sure I understand. I know what hash_combine is for but
what do you mean when you say they key should just be a dsa_pointer?
What's wrong with providing the key size, whole entry size, compare
function and hash function like this?

> +/*
> + * Make sure that RecordCacheArray is large enough to store 'typmod'.
> + */
> +static void
> +ensure_record_cache_typmod_slot_exists(int32 typmod)
> +{
> + if (RecordCacheArray == NULL)
> + {
> + RecordCacheArray = (TupleDesc *)
> + MemoryContextAllocZero(CacheMemoryContext, 64 * sizeof(TupleDesc));
> + RecordCacheArrayLen = 64;
> + }
> +
> + if (typmod >= RecordCacheArrayLen)
> + {
> + int32 newlen = RecordCacheArrayLen * 2;
> +
> + while (typmod >= newlen)
> + newlen *= 2;
> +
> + RecordCacheArray = (TupleDesc *) repalloc(RecordCacheArray,
> + newlen * sizeof(TupleDesc));
> + memset(RecordCacheArray + RecordCacheArrayLen, 0,
> + (newlen - RecordCacheArrayLen) * sizeof(TupleDesc *));
> + RecordCacheArrayLen = newlen;
> + }
> +}
>
> Do we really want to keep this? Could just have an equivalent dynahash
> for the non-parallel case?

Hmm. Well the plain old array makes a lot of sense in the
non-parallel case, since we allocate typmods starting from zero. What
don't you like about it? The reason for using an array for
backend-local lookup (aside from "that's how it is already") is that
it's actually the best data structure for the job; the reason for
using a hash table in the shared case is that it gives you locking and
coordinates growth for free. (For the OID index it has to be a hash
table in both cases.)

> /*
> * lookup_rowtype_tupdesc_internal --- internal routine to lookup a rowtype
> @@ -1229,15 +1347,49 @@ lookup_rowtype_tupdesc_internal(Oid type_id, int32 typmod, bool noError)
> /*
> * It's a transient record type, so look in our record-type table.
> */
> - if (typmod < 0 || typmod >= NextRecordTypmod)
> + if (typmod >= 0)
> {
> - if (!noError)
> - ereport(ERROR,
> - (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> - errmsg("record type has not been registered")));
> - return NULL;
> + /* It is already in our local cache? */
> + if (typmod < RecordCacheArrayLen &&
> + RecordCacheArray[typmod] != NULL)
> + return RecordCacheArray[typmod];
> +
> + /* Are we attached to a SharedRecordTypmodRegistry? */
> + if (CurrentSharedRecordTypmodRegistry.shared != NULL)
>
> Why do we want to do lookups in both? I don't think it's a good idea to
> have a chance that you could have the same typmod in both the local
> registry (because it'd been created before the shared one) and in the
> shared (because it was created in a worker). Ah, that's for caching
> purposes? If so, see my above point that we shouldn't have a serialized
> version of typdesc (yesyes, constraints will be a bit ugly).

Right, that's what I've now done. It's basically a write-through
cache: we'll try to find it in the backend local structures and then
fall back to the shared one. But if we find it in shared memory,
we'll just copy the pointer it into our local data structures.

In the last version I'd build a new TupleDesc from the serialized
form, but now there is no serialized form, just TupleDesc objects
which are now shmem-friendly (except for constraints, which do not
survive the matter transfer into shmem; see TupleDescCopy).

> + /*
> + * While we still hold the atts_index entry locked, add this to
> + * typmod_index. That's important because we don't want anyone to be able
> + * to find a typmod via the former that can't yet be looked up in the
> + * latter.
> + */
> + typmod_index_entry =
> + dht_find_or_insert(CurrentSharedRecordTypmodRegistry.typmod_index,
> + &typmod, &found);
> + if (found)
> + elog(ERROR, "cannot create duplicate shared record typmod");
> + typmod_index_entry->typmod = typmod;
> + typmod_index_entry->serialized_tupdesc = serialized_dp;
> + dht_release(CurrentSharedRecordTypmodRegistry.typmod_index,
> + typmod_index_entry);
>
> What if we fail to allocate memory for the entry in typmod_index?

Well, I was careful to make sure that it was only pushed onto the list
in the atts_index entry until after we'd successfully allocated
entries in both indexes, so there was no way to exit from this
function leaving a TupleDesc in one index but not the other. In other
words, it might have created an atts_index entry but it'd have an
empty list. But yeah, on reflection we shouldn't leak shared_dp in
that case. In this version I added PG_CATCH() to dsa_free() it and
PG_RETHROW().

More testing and review needed.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
shared-record-typmods-v4.patchset.tgz application/x-gzip 74.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-08-11 08:52:59 Re: Foreign tables privileges not shown in information_schema.table_privileges
Previous Message Álvaro Hernández Tortosa 2017-08-11 06:50:16 Re: SCRAM protocol documentation