Quick Links

Re: Cache relation sizes?

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Cache relation sizes?
Date:	2020-12-30 04:35:36
Message-ID:	20201230043536.fsqngrit2kl2mlue@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2020-11-19 18:01:14 +1300, Thomas Munro wrote:
> From ac3c61926bf947a3288724bd02cf8439ff5c14bc Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> Date: Fri, 13 Nov 2020 14:38:41 +1300
> Subject: [PATCH v2 1/2] WIP: Track relation sizes in shared memory.
>
> Introduce a fixed size pool of SMgrSharedRelation objects. A new GUC
> smgr_shared_relation controls how many can exist at once, and they are
> evicted as required.

As noted over IM, I think it'd be good if we could avoid the need for a
new GUC here and instead derive it from other settings. E.g.
max_files_per_process * MaxBackends
or maybe
max_files_per_process * log(MaxBackends)
(on the basis that it's unlikely that each backend uses completely
separate files).

Given the size of the structs it seems that the memory overhead of this
should be acceptable?

> XXX smgrimmedsync() is maybe too blunt an instrument?
>
> XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via
> some new interface, but it doesn't actually know if it's cleaning a fork
> that extended a relation...
>
> XXX perhaps bgwriter should try to clean them?
>

As I suggested on the same call, I think it might make sense to
integrate this with the checkpointer sync queue. Have checkpointer try
to keep some a certain fraction of the entries clean (or even make them
unused?), by processing all the sync requests for that
relfilnode. That'd allow to a) only sync segments we actually need to
sync b) syncing those segments again at the end of the checkpoint.

Another use-case I have is that I'd like to make file-extension better
than it currently is. There's two main challenges:

- Having bulk extension add pages to the FSM isn't actually a good
solution, as that has a high likelihood of multiple backends ending up
trying to insert into the same page. If we had, in SMgrSharedRelation,
a 'physical' relation size and a 'logical' relation size, we could do
better, and hand out pre-allocated pages to exactly one backend.

- Doing bulk-extension when there's no other space left leads to large
pile-ups of processes needing extension lock -> a large latency
increase. Instead we should opportunistically start to do extension
when we've used-up most of the bulk allocated space.

> +/*
> + * An entry in the hash table that allows us to look up objects in the
> + * SMgrSharedRelation pool by rnode (+ backend).
> + */
> +typedef struct SMgrSharedRelationMapping
> +{
> + RelFileNodeBackend rnode;
> + int index;
> +} SMgrSharedRelationMapping;

Maybe add a note that index points into
SMgrSharedRelationPool->objects[]? And why you chose to not store
SMgrSharedRelation directly in the hashtable (makes sense to me, still
worth commenting on imo).

> +/* Flags. */
> +#define SR_LOCKED 0x01
> +#define SR_VALID 0x02
> +
> +/* Each forknum gets its own dirty, syncing and just dirtied bits. */
> +#define SR_DIRTY(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0))
> +#define SR_SYNCING(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1))
> +#define SR_JUST_DIRTIED(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2))

Hm - do we really want all that stored in a single flag value?

> +/*
> + * Lock a SMgrSharedRelation. The lock is a spinlock that should be held for
> + * only a few instructions. The return value is the current set of flags,
> + * which may be modified and then passed to smgr_unlock_sr() to be atomically
> + * when the lock is released.
> + */
> +static uint32
> +smgr_lock_sr(SMgrSharedRelation *sr)
> +{
> +
> + for (;;)
> + {
> + uint32 old_flags = pg_atomic_read_u32(&sr->flags);
> + uint32 flags;
> +
> + if (!(old_flags & SR_LOCKED))
> + {
> + flags = old_flags | SR_LOCKED;
> + if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags))
> + return flags;
> + }
> + }
> + return 0; /* unreachable */
> +}

Since this is a spinlock, you should use the spin delay infrastructure
(c.f. LWLockWaitListLock()). Otherwise it'll react terribly if there
ever ended up contention.

What's the reason not to use a per-entry lwlock instead?

Naming: I find it weird to have smgr_ as a prefix and _sr as a
postfix. I'd go for smgr_sr_$op.

> +/*
> + * Unlock a SMgrSharedRelation, atomically updating its flags at the same
> + * time.
> + */
> +static void
> +smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
> +{
> + pg_write_barrier();
> + pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
> +}

This assumes there never are atomic modification of flags without
holding the spinlock (or a cmpxchg loop testing whether somebody is
holding the lock).Is that right / good?

> +/*
> + * Allocate a new invalid SMgrSharedRelation, and return it locked.
> + *
> + * The replacement algorithm is a simple FIFO design with no second chance for
> + * now.
> + */
> +static SMgrSharedRelation *
> +smgr_alloc_sr(void)
> +{
> + SMgrSharedRelationMapping *mapping;
> + SMgrSharedRelation *sr;
> + uint32 index;
> + LWLock *mapping_lock;
> + uint32 flags;
> + RelFileNodeBackend rnode;
> + uint32 hash;
> +
> + retry:
> + /* Lock the next one in clock-hand order. */
> + index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
> + sr = &sr_pool->objects[index];
> + flags = smgr_lock_sr(sr);

Modulo is pretty expensive - I'd force smgr_shared_relations to be a
power-of-two and then do a & smgr_shared_relations_mask initialized to
(smgr_shared_relations - 1).

> + /* If it's unused, can return it, still locked, immediately. */
> + if (!(flags & SR_VALID))
> + return sr;
> +
> + /*
> + * Copy the rnode and unlock. We'll briefly acquire both mapping and SR
> + * locks, but we need to do it in that order, so we'll unlock the SR
> + * first.
> + */
> + rnode = sr->rnode;
> + smgr_unlock_sr(sr, flags);
> +
> + hash = get_hash_value(sr_mapping_table, &rnode);
> + mapping_lock = SR_PARTITION_LOCK(hash);
> +
> + LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
> + mapping = hash_search_with_hash_value(sr_mapping_table,
> + &rnode,
> + hash,
> + HASH_FIND,
> + NULL);
> + if (!mapping || mapping->index != index)
> + {
> + /* Too slow, it's gone or now points somewhere else. Go around. */
> + LWLockRelease(mapping_lock);
> + goto retry;
> + }
> +
> + /* We will lock the SR for just a few instructions. */
> + flags = smgr_lock_sr(sr);
> + Assert(flags & SR_VALID);

So the replacement here is purely based on when the relation was
accessed first, not when it was accessed last. Probably good enough, but
seems worth a comment.

> 3. The nblocks value must be free enough for scans, extension,
> truncatation and dropping buffers, because the those operations all
> executed memory barriers when it acquired a snapshot to scan (which
> doesn't need to see blocks added after that) or an exclusive heavyweight
> lock to extend, truncate or drop. XXX right?

"free enough"?

I am not sure this holds for non-mvcc scans?

> @@ -699,7 +748,7 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
> bool
> smgrexists(SMgrRelation reln, ForkNumber forknum)
> {
> - if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
> + if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber)
> return true;

ISTM that all these places referencing _fast is leaking an
implementation detail (but not very far) - seems like that should be an
implementation detail for smgrnblocks() which can then call out to
_fast/_shared or such.

Greetings,

Andres Freund

In response to

Re: Cache relation sizes? at 2020-11-19 05:01:14 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Justin Pryzby	2020-12-30 04:39:30	Re: pglz compression performance, take two
Previous Message	Andres Freund	2020-12-30 03:57:36	Re: WIP: WAL prefetch (another approach)