Re: shared-memory based stats collector

From: Andres Freund <andres(at)anarazel(dot)de>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: masao(dot)fujii(at)oss(dot)nttdata(dot)com, gkokolatos(at)protonmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: shared-memory based stats collector
Date: 2021-03-11 04:26:56
Message-ID: 20210311042656.ofxeibs7z2btdcgx@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-03-10 17:51:37 +0900, Kyotaro Horiguchi wrote:
> From ed2fb2fca47fccbf9af1538688aab8334cf6470c Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horikyoga(dot)ntt(at)gmail(dot)com>
> Date: Fri, 13 Mar 2020 16:58:03 +0900
> Subject: [PATCH v52 1/7] sequential scan for dshash
>
> Dshash did not allow scan the all entries sequentially. This adds the
> functionality. The interface is similar but a bit different both from
> that of dynahash and simple dshash search functions. One of the most
> significant differences is the sequential scan interface of dshash
> always needs a call to dshash_seq_term when scan ends. Another is
> locking. Dshash holds partition lock when returning an entry,
> dshash_seq_next() also holds lock when returning an entry but callers
> shouldn't release it, since the lock is essential to continue a
> scan. The seqscan interface allows entry deletion while a scan. The
> in-scan deletion should be performed by dshash_delete_current().

*while a scan is in progress

> +void *
> +dshash_seq_next(dshash_seq_status *status)
> +{
> + dsa_pointer next_item_pointer;
> +
> + if (status->curitem == NULL)
> + {
> + int partition;
> +
> + Assert(status->curbucket == 0);
> + Assert(!status->hash_table->find_locked);
> +
> + /* first shot. grab the first item. */
> + partition =
> + PARTITION_FOR_BUCKET_INDEX(status->curbucket,
> + status->hash_table->size_log2);
> + LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
> + status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
> + status->curpartition = partition;

What does "first shot" mean here?

> +/*
> + * Terminates the seqscan and release all locks.
> + *
> + * Should be always called when finishing or exiting a seqscan.
> + */
> +void
> +dshash_seq_term(dshash_seq_status *status)
> +{
> + status->hash_table->find_locked = false;
> + status->hash_table->find_exclusively_locked = false;
> +
> + if (status->curpartition >= 0)
> + LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
> +}

I think it'd be good if you added an assertion preventing this from
being called twice.

status->curpartition should definitely be reset to something < 0 after
releasing the lock, to avoid that happening again in case of such a bug.

> +/* Get the current entry while a seq scan. */
> +void *
> +dshash_get_current(dshash_seq_status *status)
> +{
> + return ENTRY_FROM_ITEM(status->curitem);
> +}

What is this used for? It'd probably be good to assert that there's a
scan in progress too, and that locks are held - otherwise this isn't
safe, right?

> +void *
> +dshash_find_extended(dshash_table *hash_table, const void *key,
> + bool exclusive, bool nowait, bool insert, bool *found)
> +{
> + dshash_hash hash = hash_key(hash_table, key);
> + size_t partidx = PARTITION_FOR_HASH(hash);
> + dshash_partition *partition = &hash_table->control->partitions[partidx];
> + LWLockMode lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
> dshash_table_item *item;
> - hash = hash_key(hash_table, key);
> - partition_index = PARTITION_FOR_HASH(hash);
> - partition = &hash_table->control->partitions[partition_index];
> -
> - Assert(hash_table->control->magic == DSHASH_MAGIC);
> - Assert(!hash_table->find_locked);
> + /* must be exclusive when insert allowed */
> + Assert(!insert || (exclusive && found != NULL));
>
> restart:
> - LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
> - LW_EXCLUSIVE);
> + if (!nowait)
> + LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
> + else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
> + lockmode))
> + return NULL;

I think this code (and probably also some in the previous patch) would
be more readable if you introduced a local variable to store
PARTITION_LOCK(hash_table, partidx) in. There's four repetitions of
this...

> ensure_valid_bucket_pointers(hash_table);
>
> /* Search the active bucket. */
> item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
>
> if (item)
> - *found = true;
> + {
> + if (found)
> + *found = true;
> + }
> else
> {
> - *found = false;
> + if (found)
> + *found = false;
> +
> + if (!insert)
> + {
> + /* The caller didn't told to add a new entry. */

s/told/tell us/

This nested ifs here make me think that the interface is too
complicated...

> +typedef struct StatsShmemStruct
> +{
> + dsa_handle stats_dsa_handle; /* handle for stats data area */
> + dshash_table_handle hash_handle; /* shared dbstat hash */
> + int refcount; /* # of processes that is attaching the shared
> + * stats memory */
> + /* Global stats structs */
> + PgStat_Archiver archiver_stats;
> + pg_atomic_uint32 archiver_changecount;
> + PgStat_BgWriter bgwriter_stats;
> + pg_atomic_uint32 bgwriter_changecount;
> + PgStat_CheckPointer checkpointer_stats;
> + pg_atomic_uint32 checkpointer_changecount;
> + PgStat_Wal wal_stats;
> + LWLock wal_stats_lock;
> + PgStatSharedSLRUStats slru_stats;
> + pg_atomic_uint32 slru_changecount;
> + pg_atomic_uint64 stats_timestamp;

Looks like you're intending to use something similar to
PGSTAT_BEGIN_WRITE_ACTIVITY etc. But you're not using the same set of
macros, and I didn't see any comments explaining the exact locking
protocol.

I did see that you do atomic fetch_add below. That should never be
necessary though - there's only ever exactly one writer for archiver,
bgwriter, checkpointer etc.

> +
> +/* BgWriter global statistics counters */
> +PgStat_BgWriter BgWriterStats = {0};
> +
> +/* CheckPointer global statistics counters */
> +PgStat_CheckPointer CheckPointerStats = {0};
> +
> +/* WAL global statistics counters */
> +PgStat_Wal WalStats = {0};
> +

This makes it sound like these are actually global counters, visible to
everyone - but they're not, right?

> /*
> - * SLRU statistics counts waiting to be sent to the collector. These are
> - * stored directly in stats message format so they can be sent without needing
> - * to copy things around. We assume this variable inits to zeroes. Entries
> - * are one-to-one with slru_names[].
> + * XXXX: always try to flush WAL stats. We don't want to manipulate another
> + * counter during XLogInsert so we don't have an effecient short cut to know
> + * whether any counter gets incremented.
> */
> -static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
> +static inline bool
> +walstats_pending(void)
> +{
> + static const PgStat_Wal all_zeroes;
> +
> + return memcmp(&WalStats, &all_zeroes,
> + offsetof(PgStat_Wal, stat_reset_timestamp)) != 0;
> +}

That's a pretty expensive way to figure out whether there've been
changes, given how PgStat_Wal has grown. You say "don't want to
manipulate another during XLogInsert" - but a simple increment wouldn't
cost that much in addition to what the wal stats stuff already does.

> + * Per-object statistics are stored in a "shared stats", corresponding struct
> + * that has a header part common among all object types in DSA-allocated
> + * memory.

The part of the sentence doesn't quite seem to make sense gramattically.

> +static void
> +attach_shared_stats(void)
> +{
> + MemoryContext oldcontext;

> + /*
> + * The first attacher backend may still reading the stats file, or the
> + * last detacher may writing it. Wait for the work to finish.
> + */

I still believe this kind of approach is too complicated, and we should
simply not do any of this "first attacher" business. Instead read it in
the startup process or such.

> + StatsShmem->refcount++;
> + else
> {
> - ereport(LOG,
> - (errcode_for_socket_access(),
> - errmsg("could not set statistics collector socket to nonblocking mode: %m")));
> - goto startup_failed;
> + /* We're the first process to attach the shared stats memory */
> + Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
> +
> + /* Initialize shared memory area */
> + area = dsa_create(LWTRANCHE_STATS);
> + pgStatSharedHash = dshash_create(area, &dsh_params, 0);
> +
> + StatsShmem->stats_dsa_handle = dsa_get_handle(area);
> + StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
> + LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
> + pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
> +
> + /* Block the next attacher for a while, see the comment above. */
> + StatsShmem->attach_holdoff = true;
> +
> + StatsShmem->refcount = 1;
> }

At least the lwlock etc initialization should definitely just happen as
part of shared memory initialization, not at some point later (like most
(all?) other places using *statically sized* shared memory). The story
for dsa_create/dshash_create is different, but then just do that, not
everything else.

> +/* ----------
> + * get_stat_entry() -
> *
> - * Called from postmaster at startup or after an existing collector
> - * died. Attempt to fire up a fresh statistics collector.
> + * get shared stats entry for specified type, dbid and objid.
> + * If nowait is true, returns NULL on lock failure.
> *
> - * Returns PID of child process, or 0 if fail.
> + * If initfunc is not NULL, new entry is created if not yet and the function
> + * is called with the new base entry. If found is not NULL, it is set to true
> + * if existing entry is found or false if not.
> + * ----------
> + */
> +static PgStat_StatEntryHeader *
> +get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
> + bool *found)
> +{
> + PgStatHashEntry *shhashent;
> + PgStatLocalHashEntry *lohashent;
> + PgStat_StatEntryHeader *shheader = NULL;
> + PgStatHashKey key;
> + bool shfound;
> +
> + key.type = type;
> + key.databaseid = dbid;
> + key.objectid = objid;
> +
> + if (pgStatEntHash)
> + {
> + uint64 currage;
> +
> + /*
> + * pgStatEntHashAge increments quite slowly than the time the
> + * following loop takes so this is expected to iterate no more than
> + * twice.
> + */
> + while (unlikely
> + (pgStatEntHashAge !=
> + (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
> + {
> + pgstat_localhash_iterator i;
> +
> + /*
> + * Some entries have been dropped. Invalidate cache pointer to
> + * them.
> + */
> + pgstat_localhash_start_iterate(pgStatEntHash, &i);
> + while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
> + != NULL)
> + {
> + PgStat_StatEntryHeader *header = lohashent->body;
> +
> + if (header->dropped)
> + {
> + pgstat_localhash_delete(pgStatEntHash, key);
> +
> + if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
> + {
> + /*
> + * We're the last referrer to this entry, drop the
> + * shared entry.
> + */
> + dsa_free(area, lohashent->dsapointer);
> + }
> + }
> + }
> +
> + pgStatEntHashAge = currage;
> + }

This should be its own function.

> + shhashent = dshash_find_extended(pgStatSharedHash, &key,
> + create, nowait, create, &shfound);
> + if (shhashent)
> + {
> + if (create && !shfound)
> + {
> + /* Create new stats entry. */
> + dsa_pointer chunk = dsa_allocate0(area,
> + pgstat_sharedentsize[type]);
> +
> + shheader = dsa_get_address(area, chunk);
> + LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
> + pg_atomic_init_u32(&shheader->refcount, 0);
> +
> + /* Link the new entry from the hash entry. */
> + shhashent->body = chunk;
> + }
> + else
> + shheader = dsa_get_address(area, shhashent->body);
> +
> + /*
> + * We expose this shared entry now. You might think that the entry
> + * can be removed by a concurrent backend, but since we are creating
> + * an stats entry, the object actually exists and used in the upper
> + * layer. Such an object cannot be dropped until the first vacuum
> + * after the current transaction ends.
> + */
> + dshash_release_lock(pgStatSharedHash, shhashent);

I don't think you can safely release the lock before you incremented the
refcount? What if, once the lock is released, somebody looks up that
entry, increments the refcount, and decrements it again? It'll see a
refcount of 0 at the end and decide to free the memory. Then the code
below will access already freed / reused memory, no?

> @@ -2649,85 +3212,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
> /* ----------
> * pgstat_fetch_stat_dbentry() -
> *
> - * Support function for the SQL-callable pgstat* functions. Returns
> - * the collected statistics for one database or NULL. NULL doesn't mean
> - * that the database doesn't exist, it is just not yet known by the
> - * collector, so the caller is better off to report ZERO instead.
> + * Find database stats entry on backends in a palloc'ed memory.
> + *
> + * The returned entry is stored in static memory so the content is valid until
> + * the next call of the same function for the different database.
> * ----------
> */
> PgStat_StatDBEntry *
> pgstat_fetch_stat_dbentry(Oid dbid)
> {
> - /*
> - * If not done for this transaction, read the statistics collector stats
> - * file into some hash tables.
> - */
> - backend_read_statsfile();
> -
> - /*
> - * Lookup the requested database; return NULL if not found
> - */
> - return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
> - (void *) &dbid,
> - HASH_FIND, NULL);
> + PgStat_StatDBEntry *shent;
> +
> + /* should be called from backends */
> + Assert(IsUnderPostmaster);

Can't these be called from single user mode?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-03-11 04:31:20 Re: libpq debug log
Previous Message tsunakawa.takay@fujitsu.com 2021-03-11 04:12:57 RE: libpq debug log