Quick Links

Re: POC: Sharing record typmods between backends

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: POC: Sharing record typmods between backends
Date:	2017-08-15 05:44:55
Message-ID:	CAEepm=3eNfnF_7GMMLs12=O_YvO0OtDn85XwZkE1__YR_7CWPg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Aug 14, 2017 at 12:32 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Review for 0001:
>
> I think you made a few long lines even longer, like:
>
> @@ -1106,11 +1106,11 @@ pltcl_trigger_handler(PG_FUNCTION_ARGS, pltcl_call_state *call_state,
> Tcl_ListObjAppendElement(NULL, tcl_trigtup, Tcl_NewObj());
> for (i = 0; i < tupdesc->natts; i++)
> {
> - if (tupdesc->attrs[i]->attisdropped)
> + if (TupleDescAttr(tupdesc, i)->attisdropped)
> Tcl_ListObjAppendElement(NULL, tcl_trigtup, Tcl_NewObj());
> else
> Tcl_ListObjAppendElement(NULL, tcl_trigtup,
> - Tcl_NewStringObj(utf_e2u(NameStr(tupdesc->attrs[i]->attname)), -1));
> + Tcl_NewStringObj(utf_e2u(NameStr(TupleDescAttr(tupdesc, i)->attname)), -1));
>
>
> as it's not particularly pretty to access tupdesc->attrs[i] repeatedly,
> it'd be good if you instead had a local variable for the individual
> attribute.

Done.

> Similar:
> if (OidIsValid(get_base_element_type(TupleDescAttr(tupdesc, i)->atttypid)))
> sv = plperl_ref_from_pg_array(attr, TupleDescAttr(tupdesc, i)->atttypid);
> else if ((funcid = get_transform_fromsql(TupleDescAttr(tupdesc, i)->atttypid, current_call_data->prodesc->lang_oid, current_call_data->prodesc->trftypes)))
> sv = (SV *) DatumGetPointer(OidFunctionCall1(funcid, attr));

Done.

> @@ -150,7 +148,7 @@ ValuesNext(ValuesScanState *node)
> */
> values[resind] = MakeExpandedObjectReadOnly(values[resind],
> isnull[resind],
> - att[resind]->attlen);
> + TupleDescAttr(slot->tts_tupleDescriptor, resind)->attlen);
>
> @@ -158,9 +158,9 @@ convert_tuples_by_position(TupleDesc indesc,
> * must agree.
> */
> if (attrMap[i] == 0 &&
> - indesc->attrs[i]->attisdropped &&
> - indesc->attrs[i]->attlen == outdesc->attrs[i]->attlen &&
> - indesc->attrs[i]->attalign == outdesc->attrs[i]->attalign)
> + TupleDescAttr(indesc, i)->attisdropped &&
> + TupleDescAttr(indesc, i)->attlen == TupleDescAttr(outdesc, i)->attlen &&
> + TupleDescAttr(indesc, i)->attalign == TupleDescAttr(outdesc, i)->attalign)
> continue;

Done.

> I think you get the drift, there's more..../

Done in some more places too.

> Review for 0002:
>
> @@ -71,17 +71,17 @@ typedef struct tupleConstr
> typedef struct tupleDesc
> {
> int natts; /* number of attributes in the tuple */
> - Form_pg_attribute *attrs;
> - /* attrs[N] is a pointer to the description of Attribute Number N+1 */
> TupleConstr *constr; /* constraints, or NULL if none */
> Oid tdtypeid; /* composite type ID for tuple type */
> int32 tdtypmod; /* typmod for tuple type */
> bool tdhasoid; /* tuple has oid attribute in its header */
> int tdrefcount; /* reference count, or -1 if not counting */
> + /* attrs[N] is the description of Attribute Number N+1 */
> + FormData_pg_attribute attrs[FLEXIBLE_ARRAY_MEMBER];
> } *TupleDesc;
>
> sorry if I'm beating on my hobby horse, but if we're re-ordering anyway,
> can you move TupleConstr to the second-to-last? That a) seems more
> consistent but b) (hobby horse, sorry) avoids unnecessary alignment
> padding.

Done.

> @@ -734,13 +708,13 @@ BuildDescForRelation(List *schema)
> /* Override TupleDescInitEntry's settings as requested */
> TupleDescInitEntryCollation(desc, attnum, attcollation);
> if (entry->storage)
> - desc->attrs[attnum - 1]->attstorage = entry->storage;
> + desc->attrs[attnum - 1].attstorage = entry->storage;
>
> /* Fill in additional stuff not handled by TupleDescInitEntry */
> - desc->attrs[attnum - 1]->attnotnull = entry->is_not_null;
> + desc->attrs[attnum - 1].attnotnull = entry->is_not_null;
> has_not_null |= entry->is_not_null;
> - desc->attrs[attnum - 1]->attislocal = entry->is_local;
> - desc->attrs[attnum - 1]->attinhcount = entry->inhcount;
> + desc->attrs[attnum - 1].attislocal = entry->is_local;
> + desc->attrs[attnum - 1].attinhcount = entry->inhcount;
>
> This'd be a lot more readable if wed just stored desc->attrs[attnum - 1]
> in a local variable. Also think it'd be good if it just used
> TupleDescAttr() for that. Probably as part of previous commit.

Done.

> @@ -366,8 +340,8 @@ equalTupleDescs(TupleDesc tupdesc1, TupleDesc tupdesc2)
>
> for (i = 0; i < tupdesc1->natts; i++)
> {
> - Form_pg_attribute attr1 = tupdesc1->attrs[i];
> - Form_pg_attribute attr2 = tupdesc2->attrs[i];
> + Form_pg_attribute attr1 = &tupdesc1->attrs[i];
> + Form_pg_attribute attr2 = &tupdesc2->attrs[i];
>
> I'd convert all these as part of the previous commit.

Done.

> @@ -99,12 +72,9 @@ CreateTemplateTupleDesc(int natts, bool hasoid)
>
> /*
> * CreateTupleDesc
> - * This function allocates a new TupleDesc pointing to a given
> + * This function allocates a new TupleDesc by copying a given
> * Form_pg_attribute array.
> *
> - * Note: if the TupleDesc is ever freed, the Form_pg_attribute array
> - * will not be freed thereby.
> - *
>
> I'm leaning towards no, but you could argue that we should just change
> that remark to be about constr?

I don't see why.

> @@ -51,39 +51,12 @@ CreateTemplateTupleDesc(int natts, bool hasoid)
>
> /*
> * Allocate enough memory for the tuple descriptor, including the
> - * attribute rows, and set up the attribute row pointers.
> - *
> - * Note: we assume that sizeof(struct tupleDesc) is a multiple of the
> - * struct pointer alignment requirement, and hence we don't need to insert
> - * alignment padding between the struct and the array of attribute row
> - * pointers.
> - *
> - * Note: Only the fixed part of pg_attribute rows is included in tuple
> - * descriptors, so we only need ATTRIBUTE_FIXED_PART_SIZE space per attr.
> - * That might need alignment padding, however.
> + * attribute rows.
> */
> - attroffset = sizeof(struct tupleDesc) + natts * sizeof(Form_pg_attribute);
> - attroffset = MAXALIGN(attroffset);
> - stg = palloc(attroffset + natts * MAXALIGN(ATTRIBUTE_FIXED_PART_SIZE));
> + attroffset = offsetof(struct tupleDesc, attrs);
> + stg = palloc0(attroffset + natts * sizeof(FormData_pg_attribute));
> desc = (TupleDesc) stg;
>
> note that attroffset isn't used anymore after this...

Tidied.

> We have two mildly different places allocating a tupledesc struct for a
> number of elements. Seems sense to put them into an AllocTupleDesc()?

Yeah. CreateTemplateTupleDesc() is already suitable. I changed
CreateTupleDesc() to call that instead of duplicating the allocation
code.

> Review of 0003:
>
> I'm not doing a too detailed review, given I think there's some changes
> in the pipeline.

Yep. In the new patch set the hash table formerly known as DHT is now
in patch 0004 and I made the following changes based on your feedback:

1. Renamed it to "dshash". The files are named dshash.{c,h}, and the
prefix on identifiers is dshash_. You suggested dsmhash, but the "m"
didn't seem to make much sense. I considered dsahash, but dshash
seemed better. Thoughts?

2. Ripped out the incremental resizing and iterator support for now,
as discussed. I want to post patches to add those features when we
have a use case but I can see that it's no slam dunk so I want to keep
that stuff out of the dependency graph for parallel hash.

3. Added support for hash and compare functions with an extra
argument for user data, a bit like qsort_arg_comparator. This is
necessary for functions that need to be able to dereference a
dsa_pointer stored in the entry, since they need the dsa_area. (I
would normally call such an argument 'user_data' or 'context' or
something but 'arg' seemed to be establish by qsort_arg.)

> @@ -0,0 +1,124 @@
> +/*-------------------------------------------------------------------------
> + *
> + * ds_hash_table.h
> + * Concurrent hash tables backed by dynamic shared memory areas.
>
> ...
>
> +/*
> + * The opaque type representing a hash table. While the struct is defined
> + * before, client code should consider it to be be an opaque and deal only in
> + * pointers to it.
> + */
> +struct dht_hash_table;
> +typedef struct dht_hash_table dht_hash_table;
>
> "defined before"?

Bleugh. Fixed.

> +/*
> + * The opaque type used for iterator state. While the struct is actually
> + * defined below so it can be used on the stack, client code should deal only
> + * in pointers to it rather than accessing its members.
> + */
> +struct dht_iterator;
> +typedef struct dht_iterator dht_iterator;
>
> s/used/allocated/?

Removed for now, see above.

> +
> +/*
> + * The set of parameters needed to create or attach to a hash table. The
> + * members tranche_id and tranche_name do not need to be initialized when
> + * attaching to an existing hash table. The functions do need to be supplied
> + * even when attaching because we can't safely share function pointers between
> + * backends in general.
> + */
> +typedef struct
> +{
> + size_t key_size; /* Size of the key (initial bytes of entry) */
> + size_t entry_size; /* Total size of entry */
> + dht_compare_function compare_function; /* Compare function */
> + dht_hash_function hash_function; /* Hash function */
> + int tranche_id; /* The tranche ID to use for locks. */
> +} dht_parameters;
>
> Wonder if it'd make sense to say that key/entry sizes to be only
> minimums? That means we could increase them to be the proper aligned
> size?

I don't understand. You mean explicitly saying that there are
overheads? Doesn't that go without saying?

> /*
> * Unlock an entry which was locked by dht_find or dht_find_or_insert.
> */
> void
> dht_release(dht_hash_table *hash_table, void *entry)
>
> /*
> * Release the most recently obtained lock. This can optionally be called in
> * between calls to dht_iterate_next to allow other processes to access the
> * same partition of the hash table.
> */
> void
> dht_iterate_release_lock(dht_iterator *iterator)
>
> I'd add lock to the first too.

Done.

> FWIW, I'd be perfectly fine with abbreviating iterate/iterator to "iter"
> or something. We already have that elsewhere and it's pretty clear.

Will return to this question in a future submission that adds iterators.

> +/*
> + * Print out debugging information about the internal state of the hash table.
> + * No locks must be held by the caller.
> + */
> +void
>
> Should specify where the information is printed.

Done.

> + * scan. (We don't actually expect them to have more than 1 item unless
> + * the hash function is of low quality.)
>
> That, uh, seems like an hasty remark. Even with a good hash function
> you're going to get collisions occasionally.
>
>
> Review of 0004:
>
> Ignoring aspects related to REC_HASH_KEYS and related discussion, since
> we're already discussing that in another email.

This version includes new refactoring patches 0003, 0004 to get rid of
REC_HASH_KEYS by teaching the hash table how to use a TupleDesc as a
key directly. Then the shared version does approximately the same
thing, with a couple of extra hoops to jump thought. Thoughts?

> +static int32
> +find_or_allocate_shared_record_typmod(TupleDesc tupdesc)
> +{
>
> + /*
> + * While we still hold the atts_index entry locked, add this to
> + * typmod_index. That's important because we don't want anyone to be able
> + * to find a typmod via the former that can't yet be looked up in the
> + * latter.
> + */
> + PG_TRY();
> + {
> + typmod_index_entry =
> + dht_find_or_insert(CurrentSharedRecordTypmodRegistry.typmod_index,
> + &typmod, &found);
> + if (found)
> + elog(ERROR, "cannot create duplicate shared record typmod");
> + }
> + PG_CATCH();
> + {
> + /*
> + * If we failed to allocate or elog()ed, we have to be careful not to
> + * leak the shared memory. Note that we might have created a new
> + * atts_index entry above, but we haven't put anything in it yet.
> + */
> + dsa_free(CurrentSharedRecordTypmodRegistry.area, shared_dp);
> + PG_RE_THROW();
> + }
>
> Not entirely related, but I do wonder if we don't need abetter solution
> to this. Something like dsa pointers that register appropriate memory
> context callbacks to get deleted in case of errors?

Huh, scope guards. I have had some ideas about some kind of
destructor mechanism that might replace what we're doing with DSM
detach hooks in various places and also work in containers like hash
tables (ie entries could have destructors), but doing it with the
stack is another level...

> Codewise this seems ok aside from the already discussed issues.
>
> But architecturally I'm still not sure I quite like the a bit ad-hoc
> manner session state is defined here. I think we much more should go
> towards a PGPROC like PGSESSION array, that PGPROCs reference. That'd
> also be preallocated in "normal" shmem. From there things like the
> handle for a dht typmod table could be referenced. I think we should
> slowly go towards a world where session state isn't in a lot of file
> local static variables. I don't know if this is the right moment to
> start doing so, but I think it's quite soon.

No argument from me about that general idea. All our global state is
an obstacle for testability, multi-threading, new CPU scheduling
architectures etc. I had been trying to avoid getting too adventurous
here, but here goes nothing... In this version there is an honest
Session struct. There is still a single global variable --
CurrentSession -- which would I guess could be a candidate to become a
thread-local variable from the future (or alternatively an argument to
every function that needs session access). Is this better? Haven't
tested this much yet but seems like better code layout to me.

> Reviewing 0005:
>
> Yay!

That's the kind of review I like! Thanks.

I will also post some testing code soon.

--
Thomas Munro
http://www.enterprisedb.com

Attachment	Content-Type	Size
shared-record-typmods-v5.patchset.tgz	application/x-gzip	72.9 KB

In response to

Re: POC: Sharing record typmods between backends at 2017-08-14 00:32:21 from Andres Freund

Responses

Re: POC: Sharing record typmods between backends at 2017-08-15 21:17:04 from Thomas Munro
Re: POC: Sharing record typmods between backends at 2017-08-15 22:06:53 from Andres Freund
Re: POC: Sharing record typmods between backends at 2017-08-20 18:17:23 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Chris Travers	2017-08-15 06:24:37	Re: Orphaned files in base/[oid]
Previous Message	Peter Geoghegan	2017-08-15 05:43:29	Re: INSERT .. ON CONFLICT DO SELECT [FOR ..]