Quick Links

Re: ResourceOwner refactoring

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Aleksander Alekseev <aleksander(at)timescale(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: ResourceOwner refactoring
Date:	2023-07-10 19:14:46
Message-ID:	20230710191446.pn5ltm73t2i3xtxz@awork3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2023-05-25 20:14:23 +0300, Heikki Linnakangas wrote:
> @@ -2491,9 +2495,6 @@ BufferSync(int flags)
> int mask = BM_DIRTY;
> WritebackContext wb_context;
>
> - /* Make sure we can handle the pin inside SyncOneBuffer */
> - ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> -
> /*
> * Unless this is a shutdown checkpoint or we have been explicitly told,
> * we write only permanent, dirty buffers. But at shutdown or end of
> @@ -2970,9 +2971,6 @@ BgBufferSync(WritebackContext *wb_context)
> * requirements, or hit the bgwriter_lru_maxpages limit.
> */
>
> - /* Make sure we can handle the pin inside SyncOneBuffer */
> - ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> -
> num_to_scan = bufs_to_lap;
> num_written = 0;
> reusable_buffers = reusable_buffers_est;
> @@ -3054,8 +3052,6 @@ BgBufferSync(WritebackContext *wb_context)
> *
> * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
> * after locking it, but we don't care all that much.)
> - *
> - * Note: caller must have done ResourceOwnerEnlargeBuffers.
> */
> static int
> SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)

> @@ -3065,7 +3061,9 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
> uint32 buf_state;
> BufferTag tag;
>
> + /* Make sure we can handle the pin */
> ReservePrivateRefCountEntry();
> + ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
>
> /*
> * Check whether buffer needs writing.
> @@ -4110,9 +4108,6 @@ FlushRelationBuffers(Relation rel)
> return;
> }
>
> - /* Make sure we can handle the pin inside the loop */
> - ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> -
> for (i = 0; i < NBuffers; i++)
> {
> uint32 buf_state;
> @@ -4126,7 +4121,9 @@ FlushRelationBuffers(Relation rel)
> if (!BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator))
> continue;
>
> + /* Make sure we can handle the pin */
> ReservePrivateRefCountEntry();
> + ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
>
> buf_state = LockBufHdr(bufHdr);
> if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) &&
> @@ -4183,9 +4180,6 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
> if (use_bsearch)
> pg_qsort(srels, nrels, sizeof(SMgrSortArray), rlocator_comparator);
>
> - /* Make sure we can handle the pin inside the loop */
> - ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> -
> for (i = 0; i < NBuffers; i++)
> {
> SMgrSortArray *srelent = NULL;

Hm, a tad worried that increasing the number of ResourceOwnerEnlargeBuffers()
that drastically could have a visible overhead.

> +static ResourceOwnerFuncs tupdesc_resowner_funcs =
> +{
> + .name = "tupdesc reference",
> + .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
> + .release_priority = RELEASE_PRIO_TUPDESC_REFS,
> + .ReleaseResource = ResOwnerReleaseTupleDesc,
> + .PrintLeakWarning = ResOwnerPrintTupleDescLeakWarning
> +};

I think these should all be marked const, there's no need to have them be in a
mutable section of the binary. Of course that will require adjusting a bunch
of the signatures too, but that seems fine.

> --- a/src/backend/access/transam/xact.c
> +++ b/src/backend/access/transam/xact.c
> @@ -5171,9 +5171,24 @@ AbortSubTransaction(void)
> ResourceOwnerRelease(s->curTransactionOwner,
> RESOURCE_RELEASE_BEFORE_LOCKS,
> false, false);
> +
> AtEOSubXact_RelationCache(false, s->subTransactionId,
> s->parent->subTransactionId);
> +
> +
> + /*
> + * AtEOSubXact_Inval sometimes needs to temporarily
> + * bump the refcount on the relcache entries that it processes.
> + * We cannot use the subtransaction's resource owner anymore,
> + * because we've already started releasing it. But we can use
> + * the parent resource owner.
> + */
> + CurrentResourceOwner = s->parent->curTransactionOwner;
> +
> AtEOSubXact_Inval(false);
> +
> + CurrentResourceOwner = s->curTransactionOwner;
> +
> ResourceOwnerRelease(s->curTransactionOwner,
> RESOURCE_RELEASE_LOCKS,
> false, false);

I might be a bit slow this morning, but why did this need to change as part of
this?

> @@ -5182,8 +5221,8 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
> * If I/O was in progress, we always set BM_IO_ERROR, even though it's
> * possible the error condition wasn't related to the I/O.
> */
> -void
> -AbortBufferIO(Buffer buffer)
> +static void
> +AbortBufferIO(Buffer buffer, bool forget_owner)

What is the forget_owner argument for? It's always false, afaics?

> +/* Convenience wrappers over ResourceOwnerRemember/Forget */
> +#define ResourceOwnerRememberCatCacheRef(owner, tuple) \
> + ResourceOwnerRemember(owner, PointerGetDatum(tuple), &catcache_resowner_funcs)
> +#define ResourceOwnerForgetCatCacheRef(owner, tuple) \
> + ResourceOwnerForget(owner, PointerGetDatum(tuple), &catcache_resowner_funcs)
> +#define ResourceOwnerRememberCatCacheListRef(owner, list) \
> + ResourceOwnerRemember(owner, PointerGetDatum(list), &catlistref_resowner_funcs)
> +#define ResourceOwnerForgetCatCacheListRef(owner, list) \
> + ResourceOwnerForget(owner, PointerGetDatum(list), &catlistref_resowner_funcs)

Not specific to this type of resource owner, but I wonder if it'd not be
better to use inline functions for these - the PointerGetDatum() cast removes
nearly all type safety.

> +Releasing
> +---------
> +
> +Releasing the resources of a ResourceOwner happens in three phases:
> +
> +1. "Before-locks" resources
> +
> +2. Locks
> +
> +3. "After-locks" resources

Given that we now have priorites, I wonder if we shouldn't merge the phase and
the priorities? We have code like this:

ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_BEFORE_LOCKS,
true, true);
...
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_LOCKS,
true, true);
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_AFTER_LOCKS,
true, true);

Now, admittedly locks currently are "special" anyway, but we have a plan to
get rid of that. If we did, then we could do the last two as one as
ResourceOwnerRelease(from = RELEASE_PRIO_LOCKS, to: RELEASE_PRIO_LAST, ...)
or such.

> +Normally, you are expected to call ResourceOwnerForget on every resource so
> +that at commit, the ResourceOwner is empty (locks are an exception). If there
> +are any resources still held at commit, ResourceOwnerRelease will call the
> +PrintLeakWarning callback on each such resource. At abort, however, we truly
> +rely on the ResourceOwner mechanism and it is normal that there are resources
> +to be released.

I am not sure it's a good idea to encode that all kinds of resources ought to
have been released prior on commit. I can see that not always making sense -
in fact we already don't do it for locks, no? Perhaps just add a "commonly"?

> +typedef struct ResourceElem
> {
> ...
> + Datum item;
> + ResourceOwnerFuncs *kind; /* NULL indicates a free hash table slot */
> +} ResourceElem;

> /*
> - * Initially allocated size of a ResourceArray. Must be power of two since
> - * we'll use (arraysize - 1) as mask for hashing.
> + * Size of the fixed-size array to hold most-recently remembered resources.
> */
> -#define RESARRAY_INIT_SIZE 16
> +#define RESOWNER_ARRAY_SIZE 32

That's 512 bytes, pretty large to be searched sequentially. It's somewhat sad
that the array needs to include 8 byte of ResourceOwnerFuncs...

> + bool releasing; /* has ResourceOwnerRelease been called? */
> +
> + /*
> + * The fixed-size array for recent resources.
> + *
> + * If 'releasing' is set, the contents are sorted by release priority.
> + */
> + uint32 narr; /* how many items are stored in the array */
> + ResourceElem arr[RESOWNER_ARRAY_SIZE];
> +
> + /*
> + * The hash table. Uses open-addressing. 'nhash' is the number of items
> + * present; if it would exceed 'grow_at', we enlarge it and re-hash.
> + * 'grow_at' should be rather less than 'capacity' so that we don't waste
> + * too much time searching for empty slots.
> + *
> + * If 'releasing' is set, the contents are no longer hashed, but sorted by
> + * release priority. The first 'nhash' elements are occupied, the rest
> + * are empty.
> + */
> + ResourceElem *hash;
> + uint32 nhash; /* how many items are stored in the hash */
> + uint32 capacity; /* allocated length of hash[] */
> + uint32 grow_at; /* grow hash when reach this */
> +
> + /* The local locks cache. */
> + uint32 nlocks; /* number of owned locks */
> LOCALLOCK *locks[MAX_RESOWNER_LOCKS]; /* list of owned locks */

Due to 'narr' being before the large 'arr', 'nhash' is always on a separate
cacheline. I.e. the number of cachelines accessed in happy paths is higher
than necessary, also relevant because there's more dependencies on narr, nhash
than on the contents of those.

I'd probably order it so that both of the large elements (arr, locks) are at
the end.

It's hard to know with changes this small, but it does seem to yield a small
and repeatable performance benefit (pipelined pgbench -S).

> -} ResourceOwnerData;
> +} ResourceOwnerData;

That one pgindent will likely undo again ;)

> +/*
> + * Forget that an object is owned by a ResourceOwner
> + *
> + * Note: if same resource ID is associated with the ResourceOwner more than
> + * once, one instance is removed.
> + */
> +void
> +ResourceOwnerForget(ResourceOwner owner, Datum value, ResourceOwnerFuncs *kind)
> +{
> +#ifdef RESOWNER_TRACE
> + elog(LOG, "FORGET %d: owner %p value " UINT64_FORMAT ", kind: %s",
> + resowner_trace_counter++, owner, DatumGetUInt64(value), kind->name);
> +#endif
> +
> + /*
> + * Mustn't call this after we have already started releasing resources.
> + * (Release callback functions are not allowed to release additional
> + * resources.)
> + */
> + if (owner->releasing)
> + elog(ERROR, "ResourceOwnerForget called for %s after release started", kind->name);
> +
> + /* Search through all items in the array first. */
> + for (uint32 i = 0; i < owner->narr; i++)
> + {

Hm, I think we oftend up releasing resources in a lifo order. Would it
possibly make sense to search in reverse order?

Separately, I wonder if it'd be worth to check if there are any other matches
when assertions are enabled.

> ResourceOwnerRelease(ResourceOwner owner,
> @@ -492,6 +631,15 @@ ResourceOwnerRelease(ResourceOwner owner,
> {
> /* There's not currently any setup needed before recursing */
> ResourceOwnerReleaseInternal(owner, phase, isCommit, isTopLevel);
> +
> +#ifdef RESOWNER_STATS
> + if (isTopLevel)
> + {
> + elog(LOG, "RESOWNER STATS: lookups: array %d, hash %d", narray_lookups, nhash_lookups);
> + narray_lookups = 0;
> + nhash_lookups = 0;
> + }
> +#endif
> }

Why do we have ResourceOwnerRelease() vs ResourceOwnerReleaseInternal()? Looks
like that's older than your patch, but still confused.

> + /* Let add-on modules get a chance too */
> + for (item = ResourceRelease_callbacks; item; item = next)
> + {
> + /* allow callbacks to unregister themselves when called */
> + next = item->next;
> + item->callback(phase, isCommit, isTopLevel, item->arg);
> + }

I wonder if it's really a good idea to continue having this API. What you are
allowed to do in a resowner changed

> +/*
> + * ResourceOwnerReleaseAllOfKind
> + * Release all resources of a certain type held by this owner.
> + */
> +void
> +ResourceOwnerReleaseAllOfKind(ResourceOwner owner, ResourceOwnerFuncs *kind)
> +{
> + /* Mustn't call this after we have already started releasing resources. */
> + if (owner->releasing)
> + elog(ERROR, "ResourceOwnerForget called for %s after release started", kind->name);
>
> - /* Ditto for plancache references */
> - while (ResourceArrayGetAny(&(owner->planrefarr), &foundres))
> + /*
> + * Temporarily set 'releasing', to prevent calls to ResourceOwnerRemember
> + * while we're scanning the owner. Enlarging the hash would cause us to
> + * lose track of the point we're scanning.
> + */
> + owner->releasing = true;

Is it a problem than a possible error would lead to ->releasing = true to be
"leaked"? I think it might be, because we haven't called ResourceOwnerSort(),
which means we'd not process resources in the right order during abort
processing.

> +/*
> + * Hash function for value+kind combination.
> + */
> static inline uint32
> hash_resource_elem(Datum value, ResourceOwnerFuncs *kind)
> {
> - Datum data[2];
> -
> - data[0] = value;
> - data[1] = PointerGetDatum(kind);
> -
> - return hash_bytes((unsigned char *) &data, 2 * SIZEOF_DATUM);
> + /*
> + * Most resource kinds store a pointer in 'value', and pointers are unique
> + * all on their own. But some resources store plain integers (Files and
> + * Buffers as of this writing), so we want to incorporate the 'kind' in
> + * the hash too, otherwise those resources will collide a lot. But
> + * because there are only a few resource kinds like that - and only a few
> + * resource kinds to begin with - we don't need to work too hard to mix
> + * 'kind' into the hash. Just add it with hash_combine(), it perturbs the
> + * result enough for our purposes.
> + */
> +#if SIZEOF_DATUM == 8
> + return hash_combine64(murmurhash64((uint64) value), (uint64) kind);
> +#else
> + return hash_combine(murmurhash32((uint32) value), (uint32) kind);
> +#endif
> }

Why are we using a 64bit hash to then just throw out the high bits?

Greetings,

Andres Freund

In response to

Re: ResourceOwner refactoring at 2023-05-25 17:14:23 from Heikki Linnakangas

Responses

Re: ResourceOwner refactoring at 2023-07-10 19:40:00 from Andres Freund
Re: ResourceOwner refactoring at 2023-10-25 12:43:36 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2023-07-10 19:40:00	Re: ResourceOwner refactoring
Previous Message	Sergei Kornilov	2023-07-10 18:39:41	Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value