Re: POC: Sharing record typmods between backends

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Sharing record typmods between backends
Date: 2017-08-14 00:32:21
Message-ID: 20170814003221.ujslxthyeuukwfxn@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2017-08-11 20:39:13 +1200, Thomas Munro wrote:
> Please find attached a new patch series.

Review for 0001:

I think you made a few long lines even longer, like:

@@ -1106,11 +1106,11 @@ pltcl_trigger_handler(PG_FUNCTION_ARGS, pltcl_call_state *call_state,
Tcl_ListObjAppendElement(NULL, tcl_trigtup, Tcl_NewObj());
for (i = 0; i < tupdesc->natts; i++)
{
- if (tupdesc->attrs[i]->attisdropped)
+ if (TupleDescAttr(tupdesc, i)->attisdropped)
Tcl_ListObjAppendElement(NULL, tcl_trigtup, Tcl_NewObj());
else
Tcl_ListObjAppendElement(NULL, tcl_trigtup,
- Tcl_NewStringObj(utf_e2u(NameStr(tupdesc->attrs[i]->attname)), -1));
+ Tcl_NewStringObj(utf_e2u(NameStr(TupleDescAttr(tupdesc, i)->attname)), -1));

as it's not particularly pretty to access tupdesc->attrs[i] repeatedly,
it'd be good if you instead had a local variable for the individual
attribute.

Similar:
if (OidIsValid(get_base_element_type(TupleDescAttr(tupdesc, i)->atttypid)))
sv = plperl_ref_from_pg_array(attr, TupleDescAttr(tupdesc, i)->atttypid);
else if ((funcid = get_transform_fromsql(TupleDescAttr(tupdesc, i)->atttypid, current_call_data->prodesc->lang_oid, current_call_data->prodesc->trftypes)))
sv = (SV *) DatumGetPointer(OidFunctionCall1(funcid, attr));

@@ -150,7 +148,7 @@ ValuesNext(ValuesScanState *node)
*/
values[resind] = MakeExpandedObjectReadOnly(values[resind],
isnull[resind],
- att[resind]->attlen);
+ TupleDescAttr(slot->tts_tupleDescriptor, resind)->attlen);

@@ -158,9 +158,9 @@ convert_tuples_by_position(TupleDesc indesc,
* must agree.
*/
if (attrMap[i] == 0 &&
- indesc->attrs[i]->attisdropped &&
- indesc->attrs[i]->attlen == outdesc->attrs[i]->attlen &&
- indesc->attrs[i]->attalign == outdesc->attrs[i]->attalign)
+ TupleDescAttr(indesc, i)->attisdropped &&
+ TupleDescAttr(indesc, i)->attlen == TupleDescAttr(outdesc, i)->attlen &&
+ TupleDescAttr(indesc, i)->attalign == TupleDescAttr(outdesc, i)->attalign)
continue;

I think you get the drift, there's more....

Otherwise this seems fairly boring...

Review for 0002:

@@ -71,17 +71,17 @@ typedef struct tupleConstr
typedef struct tupleDesc
{
int natts; /* number of attributes in the tuple */
- Form_pg_attribute *attrs;
- /* attrs[N] is a pointer to the description of Attribute Number N+1 */
TupleConstr *constr; /* constraints, or NULL if none */
Oid tdtypeid; /* composite type ID for tuple type */
int32 tdtypmod; /* typmod for tuple type */
bool tdhasoid; /* tuple has oid attribute in its header */
int tdrefcount; /* reference count, or -1 if not counting */
+ /* attrs[N] is the description of Attribute Number N+1 */
+ FormData_pg_attribute attrs[FLEXIBLE_ARRAY_MEMBER];
} *TupleDesc;

sorry if I'm beating on my hobby horse, but if we're re-ordering anyway,
can you move TupleConstr to the second-to-last? That a) seems more
consistent but b) (hobby horse, sorry) avoids unnecessary alignment
padding.

@@ -734,13 +708,13 @@ BuildDescForRelation(List *schema)
/* Override TupleDescInitEntry's settings as requested */
TupleDescInitEntryCollation(desc, attnum, attcollation);
if (entry->storage)
- desc->attrs[attnum - 1]->attstorage = entry->storage;
+ desc->attrs[attnum - 1].attstorage = entry->storage;

/* Fill in additional stuff not handled by TupleDescInitEntry */
- desc->attrs[attnum - 1]->attnotnull = entry->is_not_null;
+ desc->attrs[attnum - 1].attnotnull = entry->is_not_null;
has_not_null |= entry->is_not_null;
- desc->attrs[attnum - 1]->attislocal = entry->is_local;
- desc->attrs[attnum - 1]->attinhcount = entry->inhcount;
+ desc->attrs[attnum - 1].attislocal = entry->is_local;
+ desc->attrs[attnum - 1].attinhcount = entry->inhcount;

This'd be a lot more readable if wed just stored desc->attrs[attnum - 1]
in a local variable. Also think it'd be good if it just used
TupleDescAttr() for that. Probably as part of previous commit.

@@ -366,8 +340,8 @@ equalTupleDescs(TupleDesc tupdesc1, TupleDesc tupdesc2)

for (i = 0; i < tupdesc1->natts; i++)
{
- Form_pg_attribute attr1 = tupdesc1->attrs[i];
- Form_pg_attribute attr2 = tupdesc2->attrs[i];
+ Form_pg_attribute attr1 = &tupdesc1->attrs[i];
+ Form_pg_attribute attr2 = &tupdesc2->attrs[i];

I'd convert all these as part of the previous commit.

@@ -99,12 +72,9 @@ CreateTemplateTupleDesc(int natts, bool hasoid)

/*
* CreateTupleDesc
- * This function allocates a new TupleDesc pointing to a given
+ * This function allocates a new TupleDesc by copying a given
* Form_pg_attribute array.
*
- * Note: if the TupleDesc is ever freed, the Form_pg_attribute array
- * will not be freed thereby.
- *

I'm leaning towards no, but you could argue that we should just change
that remark to be about constr?

@@ -51,39 +51,12 @@ CreateTemplateTupleDesc(int natts, bool hasoid)

/*
* Allocate enough memory for the tuple descriptor, including the
- * attribute rows, and set up the attribute row pointers.
- *
- * Note: we assume that sizeof(struct tupleDesc) is a multiple of the
- * struct pointer alignment requirement, and hence we don't need to insert
- * alignment padding between the struct and the array of attribute row
- * pointers.
- *
- * Note: Only the fixed part of pg_attribute rows is included in tuple
- * descriptors, so we only need ATTRIBUTE_FIXED_PART_SIZE space per attr.
- * That might need alignment padding, however.
+ * attribute rows.
*/
- attroffset = sizeof(struct tupleDesc) + natts * sizeof(Form_pg_attribute);
- attroffset = MAXALIGN(attroffset);
- stg = palloc(attroffset + natts * MAXALIGN(ATTRIBUTE_FIXED_PART_SIZE));
+ attroffset = offsetof(struct tupleDesc, attrs);
+ stg = palloc0(attroffset + natts * sizeof(FormData_pg_attribute));
desc = (TupleDesc) stg;

note that attroffset isn't used anymore after this...

We have two mildly different places allocating a tupledesc struct for a
number of elements. Seems sense to put them into an AllocTupleDesc()?

Review of 0003:

I'm not doing a too detailed review, given I think there's some changes
in the pipeline.

@@ -0,0 +1,124 @@
+/*-------------------------------------------------------------------------
+ *
+ * ds_hash_table.h
+ * Concurrent hash tables backed by dynamic shared memory areas.

...

+/*
+ * The opaque type representing a hash table. While the struct is defined
+ * before, client code should consider it to be be an opaque and deal only in
+ * pointers to it.
+ */
+struct dht_hash_table;
+typedef struct dht_hash_table dht_hash_table;

"defined before"?

+/*
+ * The opaque type used for iterator state. While the struct is actually
+ * defined below so it can be used on the stack, client code should deal only
+ * in pointers to it rather than accessing its members.
+ */
+struct dht_iterator;
+typedef struct dht_iterator dht_iterator;

s/used/allocated/?

+
+/*
+ * The set of parameters needed to create or attach to a hash table. The
+ * members tranche_id and tranche_name do not need to be initialized when
+ * attaching to an existing hash table. The functions do need to be supplied
+ * even when attaching because we can't safely share function pointers between
+ * backends in general.
+ */
+typedef struct
+{
+ size_t key_size; /* Size of the key (initial bytes of entry) */
+ size_t entry_size; /* Total size of entry */
+ dht_compare_function compare_function; /* Compare function */
+ dht_hash_function hash_function; /* Hash function */
+ int tranche_id; /* The tranche ID to use for locks. */
+} dht_parameters;

Wonder if it'd make sense to say that key/entry sizes to be only
minimums? That means we could increase them to be the proper aligned
size?

/*
* Unlock an entry which was locked by dht_find or dht_find_or_insert.
*/
void
dht_release(dht_hash_table *hash_table, void *entry)

/*
* Release the most recently obtained lock. This can optionally be called in
* between calls to dht_iterate_next to allow other processes to access the
* same partition of the hash table.
*/
void
dht_iterate_release_lock(dht_iterator *iterator)

I'd add lock to the first too.

FWIW, I'd be perfectly fine with abbreviating iterate/iterator to "iter"
or something. We already have that elsewhere and it's pretty clear.

+/*
+ * Print out debugging information about the internal state of the hash table.
+ * No locks must be held by the caller.
+ */
+void

Should specify where the information is printed.

+ * scan. (We don't actually expect them to have more than 1 item unless
+ * the hash function is of low quality.)

That, uh, seems like an hasty remark. Even with a good hash function
you're going to get collisions occasionally.

Review of 0004:

Ignoring aspects related to REC_HASH_KEYS and related discussion, since
we're already discussing that in another email.

+static int32
+find_or_allocate_shared_record_typmod(TupleDesc tupdesc)
+{

+ /*
+ * While we still hold the atts_index entry locked, add this to
+ * typmod_index. That's important because we don't want anyone to be able
+ * to find a typmod via the former that can't yet be looked up in the
+ * latter.
+ */
+ PG_TRY();
+ {
+ typmod_index_entry =
+ dht_find_or_insert(CurrentSharedRecordTypmodRegistry.typmod_index,
+ &typmod, &found);
+ if (found)
+ elog(ERROR, "cannot create duplicate shared record typmod");
+ }
+ PG_CATCH();
+ {
+ /*
+ * If we failed to allocate or elog()ed, we have to be careful not to
+ * leak the shared memory. Note that we might have created a new
+ * atts_index entry above, but we haven't put anything in it yet.
+ */
+ dsa_free(CurrentSharedRecordTypmodRegistry.area, shared_dp);
+ PG_RE_THROW();
+ }

Not entirely related, but I do wonder if we don't need abetter solution
to this. Something like dsa pointers that register appropriate memory
context callbacks to get deleted in case of errors?

Codewise this seems ok aside from the already discussed issues.

But architecturally I'm still not sure I quite like the a bit ad-hoc
manner session state is defined here. I think we much more should go
towards a PGPROC like PGSESSION array, that PGPROCs reference. That'd
also be preallocated in "normal" shmem. From there things like the
handle for a dht typmod table could be referenced. I think we should
slowly go towards a world where session state isn't in a lot of file
local static variables. I don't know if this is the right moment to
start doing so, but I think it's quite soon.

Reviewing 0005:

Yay!

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2017-08-14 00:35:42 PATCH : Generational memory allocator (was PATCH: two slab-like memory allocators)
Previous Message Tomas Vondra 2017-08-13 22:48:27 PATCH: multivariate histograms and MCV lists