Quick Links

Re: shared-memory based stats collector

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	michael(at)paquier(dot)xyz, alvherre(at)2ndquadrant(dot)com, thomas(dot)munro(at)gmail(dot)com, tomas(dot)vondra(at)2ndquadrant(dot)com, a(dot)zakirov(at)postgrespro(dot)ru, ah(at)cybertec(dot)at, magnus(at)hagander(dot)net, robertmhaas(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: shared-memory based stats collector
Date:	2020-03-13 03:13:24
Message-ID:	20200313031324.4al3nnwxnzl2j4sb@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Thomas, could you look at the first two patches here, and my review
questions?

General comments about this series:
- A lot of the review comments feel like I've written them before, a
year or more ago. I feel this patch ought to be in a much better
state. There's a lot of IMO fairly obvious stuff here, and things that
have been mentioned multiple times previously.
- There's a *lot* of typos in here. I realize being an ESL is hard, but
a lot of these can be found with the simplest spellchecker. That's
one thing for a patch that just has been hacked up as a POC, but this
is a multi year thread?
- There's some odd formatting. Consider using pgindent more regularly.

More detailed comments below.

I'm considering rewriting the parts of the patchset that I don't like -
but it'll look quite different afterwards.

On 2020-01-22 17:24:04 +0900, Kyotaro Horiguchi wrote:
> From 5f7946522dc189429008e830af33ff2db435dd42 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> Date: Fri, 29 Jun 2018 16:41:04 +0900
> Subject: [PATCH 1/5] sequential scan for dshash
>
> Add sequential scan feature to dshash.

> dsa_pointer item_pointer = hash_table->buckets[i];
> @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
> LW_EXCLUSIVE));
>
> delete_item(hash_table, item);
> - hash_table->find_locked = false;
> - hash_table->find_exclusively_locked = false;
> - LWLockRelease(PARTITION_LOCK(hash_table, partition));
> +
> + /* We need to keep partition lock while sequential scan */
> + if (!hash_table->seqscan_running)
> + {
> + hash_table->find_locked = false;
> + hash_table->find_exclusively_locked = false;
> + LWLockRelease(PARTITION_LOCK(hash_table, partition));
> + }
> }

This seems like a failure prone API.

> /*
> @@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
> Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
> hash_table->find_exclusively_locked
> ? LW_EXCLUSIVE : LW_SHARED));
> + /* lock is under control of sequential scan */
> + Assert(!hash_table->seqscan_running);
>
> hash_table->find_locked = false;
> hash_table->find_exclusively_locked = false;
> @@ -592,6 +610,164 @@ dshash_memhash(const void *v, size_t size, void *arg)
> return tag_hash(v, size);
> }
>
> +/*
> + * dshash_seq_init/_next/_term
> + * Sequentially scan trhough dshash table and return all the
> + * elements one by one, return NULL when no more.

s/trhough/through/

This uses a different comment style that the other functions in this
file. Why?

> + * dshash_seq_term should be called for incomplete scans and otherwise
> + * shoudln't. Finished scans are cleaned up automatically.

s/shoudln't/shouldn't/

I find the "cleaned up automatically" API terrible. I know you copied it
from dynahash, but I find it to be really failure prone. dynahash isn't
an example of good postgres code, the opposite, I'd say. It's a lot
easier to unconditionally have a terminate call if we need that.

> + * Returned elements are locked as is the case with dshash_find. However, the
> + * caller must not release the lock.
> + *
> + * Same as dynanash, the caller may delete returned elements midst of a scan.

I think it's a bad idea to refer to dynahash here. That's just going to
get out of date. Also, code should be documented on its own.

> + * If consistent is set for dshash_seq_init, the all hash table partitions are
> + * locked in the requested mode (as determined by the exclusive flag) during
> + * the scan. Otherwise partitions are locked in one-at-a-time way during the
> + * scan.

Yet delete unconditionally retains locks?

> + */
> +void
> +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
> + bool consistent, bool exclusive)
> +{

Why does this patch add the consistent mode? There's no users currently?
Without it's not clear that we need a seperate _term function, I think?

I think we also can get rid of the dshash_delete changes, by instead
adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
or such.

> @@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
> extern void dshash_detach(dshash_table *hash_table);
> extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
> extern void dshash_destroy(dshash_table *hash_table);
> -
> /* Finding, creating, deleting entries. */
> extern void *dshash_find(dshash_table *hash_table,
> const void *key, bool
> exclusive);

There's a number of spurious changes like this.

> From 60da67814fe40fd2a0c1870b15dcf6fcb21c989a Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> Date: Thu, 27 Sep 2018 11:15:19 +0900
> Subject: [PATCH 2/5] Add conditional lock feature to dshash
>
> Dshash currently waits for lock unconditionally. This commit adds new
> interfaces for dshash_find and dshash_find_or_insert. The new
> interfaces have an extra parameter "nowait" taht commands not to wait
> for lock.

s/taht/that/

There should be at least a sentence or two explaining why these are
useful.

> +/*
> + * The version of dshash_find, which is allowed to return immediately on lock
> + * failure. Lock status is set to *lock_failed in that case.
> + */

Hm. Not sure I like the *lock_acquired API.

> +void *
> +dshash_find_extended(dshash_table *hash_table, const void *key,
> + bool exclusive, bool nowait, bool *lock_acquired)
> {
> dshash_hash hash;
> size_t partition;
> dshash_table_item *item;
>
> + /*
> + * No need to return lock resut when !nowait. Otherwise the caller may
> + * omit the lock result when NULL is returned.
> + */
> + Assert(nowait || !lock_acquired);
> +
> hash = hash_key(hash_table, key);
> partition = PARTITION_FOR_HASH(hash);
>
> Assert(hash_table->control->magic == DSHASH_MAGIC);
> Assert(!hash_table->find_locked);
>
> - LWLockAcquire(PARTITION_LOCK(hash_table, partition),
> - exclusive ? LW_EXCLUSIVE : LW_SHARED);
> + if (nowait)
> + {
> + if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
> + exclusive ? LW_EXCLUSIVE : LW_SHARED))
> + {
> + if (lock_acquired)
> + *lock_acquired = false;

Why is the test for lock_acquired needed here? I don't think it's
possible to use nowait correctly without passing in lock_acquired?

Think it'd make sense to document & assert that nowait = true implies
lock_acquired set, and nowait = false implies lock_acquired not being
set.

But, uh, why do we even need the lock_acquired parameter? If we couldn't
find an entry, then we should just release the lock, no?

I'm however inclined to think it's better to just have a separate
function for the nowait case, rather than an extended version supporting
both (with an internal helper doing most of the work).

> +/*
> + * The version of dshash_find_or_insert, which is allowed to return immediately
> + * on lock failure.
> + *
> + * Notes above dshash_find_extended() regarding locking and error handling
> + * equally apply here.

They don't, there's no lock_acquired parameter.

> + */
> +void *
> +dshash_find_or_insert_extended(dshash_table *hash_table,
> + const void *key,
> + bool *found,
> + bool nowait)

I think it's absurd to have dshash_find, dshash_find_extended,
dshash_find_or_insert, dshash_find_or_insert_extended. If they're
extended they should also be able to specify whether the entry will get
created.

> From d10c1117cec77a474dbb2cff001086d828b79624 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> Date: Wed, 7 Nov 2018 16:53:49 +0900
> Subject: [PATCH 3/5] Make archiver process an auxiliary process
>
> This is a preliminary patch for shared-memory based stats collector.
> Archiver process must be a auxiliary process since it uses shared
> memory after stats data wes moved onto shared-memory. Make the process

s/wes/was/ s/onto/into/

> an auxiliary process in order to make it work.

> @@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
> StartupProcessMain();
> proc_exit(1); /* should never return */
>
> + case ArchiverProcess:
> + /* don't set signals, archiver has its own agenda */
> + PgArchiverMain();
> + proc_exit(1); /* should never return */
> +
> case BgWriterProcess:
> /* don't set signals, bgwriter has its own agenda */
> BackgroundWriterMain();

I think I'd rather remove the two comments that are copied to 6 out of 8
cases - they don't add anything.

> /* ------------------------------------------------------------
> * Local functions called by archiver follow
> * ------------------------------------------------------------
> @@ -219,8 +148,8 @@ pgarch_forkexec(void)
> * The argc/argv parameters are valid only in EXEC_BACKEND case. However,
> * since we don't use 'em, it hardly matters...
> */
> -NON_EXEC_STATIC void
> -PgArchiverMain(int argc, char *argv[])
> +void
> +PgArchiverMain(void)
> {
> /*
> * Ignore all signals usually bound to some action in the postmaster,
> @@ -252,8 +181,27 @@ PgArchiverMain(int argc, char *argv[])
> static void
> pgarch_exit(SIGNAL_ARGS)
> {
> - /* SIGQUIT means curl up and die ... */
> - exit(1);
> + PG_SETMASK(&BlockSig);
> +
> + /*
> + * We DO NOT want to run proc_exit() callbacks -- we're here because
> + * shared memory may be corrupted, so we don't want to try to clean up our
> + * transaction. Just nail the windows shut and get out of town. Now that
> + * there's an atexit callback to prevent third-party code from breaking
> + * things by calling exit() directly, we have to reset the callbacks
> + * explicitly to make this work as intended.
> + */
> + on_exit_reset();
> +
> + /*
> + * Note we do exit(2) not exit(0). This is to force the postmaster into a
> + * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
> + * process. This is necessary precisely because we don't clean up our
> + * shared memory state. (The "dead man switch" mechanism in pmsignal.c
> + * should ensure the postmaster sees this as a crash, too, but no harm in
> + * being doubly sure.)
> + */
> + exit(2);
> }
>

This seems to be a copy of code & comments from other signal handlers that predates

commit 8e19a82640d3fa2350db146ec72916856dd02f0a
Author: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
Date: 2018-08-08 19:08:10 +0300

Don't run atexit callbacks in quickdie signal handlers.

I think this just should use SignalHandlerForCrashExit().

I think we can even commit that separately - there's not really a reason
to not do that today, as far as I can tell?

> /* SIGUSR1 signal handler for archiver process */

Hm - this currently doesn't set up a correct sigusr1 handler for a
shared memory backend - needs to invoke procsignal_sigusr1_handler
somewhere.

We can probably just convert to using normal latches here, and remove
the current 'wakened' logic? That'll remove the indirection via
postmaster too, which is nice.

> @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
>
> switch (backendType)
> {
> + case B_ARCHIVER:
> + backendDesc = "archiver";
> + break;

should imo include 'WAL' or such.

> From 5079583c447c3172aa0b4f8c0f0a46f6e1512812 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> Date: Thu, 21 Feb 2019 12:44:56 +0900
> Subject: [PATCH 4/5] Shared-memory based stats collector
>
> Previously activity statistics is shared via files on disk. Every
> backend sends the numbers to the stats collector process via a socket.
> It makes snapshots as a set of files on disk with a certain interval
> then every backend reads them as necessary. It worked fine for
> comparatively small set of statistics but the set is under the
> pressure to growing up and the file size has reached the order of
> megabytes. To deal with larger statistics set, this patch let backends
> directly share the statistics via shared memory.

This spends a fair bit describing the old state, but very little
describing the new state.

> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 0bfd6151c4..a6b0bdec12 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -53,7 +53,6 @@ postgres 15554 0.0 0.0 57536 1184 ? Ss 18:02 0:00 postgres: back
> postgres 15555 0.0 0.0 57536 916 ? Ss 18:02 0:00 postgres: checkpointer
> postgres 15556 0.0 0.0 57536 916 ? Ss 18:02 0:00 postgres: walwriter
> postgres 15557 0.0 0.0 58504 2244 ? Ss 18:02 0:00 postgres: autovacuum launcher
> -postgres 15558 0.0 0.0 17512 1068 ? Ss 18:02 0:00 postgres: stats collector
> postgres 15582 0.0 0.0 58772 3080 ? Ss 18:04 0:00 postgres: joe runbug 127.0.0.1 idle
> postgres 15606 0.0 0.0 58772 3052 ? Ss 18:07 0:00 postgres: tgl regression [local] SELECT waiting
> postgres 15610 0.0 0.0 58772 3056 ? Ss 18:07 0:00 postgres: tgl regression [local] idle in transaction
> @@ -65,9 +64,8 @@ postgres 15610 0.0 0.0 58772 3056 ? Ss 18:07 0:00 postgres: tgl
> master server process. The command arguments
> shown for it are the same ones used when it was launched. The next five
> processes are background worker processes automatically launched by the
> - master process. (The <quote>stats collector</quote> process will not be present
> - if you have set the system not to start the statistics collector; likewise
> - the <quote>autovacuum launcher</quote> process can be disabled.)
> + master process. (The <quote>autovacuum launcher</quote> process will not
> + be present if you have set the system not to start it.)
> Each of the remaining
> processes is a server process handling one client connection. Each such
> process sets its command line display in the form

There's more references to the stats collector than this... E.g. in
catalogs.sgml

<xref linkend="view-table"/> lists the system views described here.
More detailed documentation of each view follows below.
There are some additional views that provide access to the results of
the statistics collector; they are described in <xref
linkend="monitoring-stats-views-table"/>.
</para>

> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 6d1f28c327..8dcb0fb7f7 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -1956,15 +1956,15 @@ do_autovacuum(void)
> ALLOCSET_DEFAULT_SIZES);
> MemoryContextSwitchTo(AutovacMemCxt);
>
> + /* Start a transaction so our commands have one to play into. */
> + StartTransactionCommand();
> +
> /*
> * may be NULL if we couldn't find an entry (only happens if we are
> * forcing a vacuum for anti-wrap purposes).
> */
> dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
>
> - /* Start a transaction so our commands have one to play into. */
> - StartTransactionCommand();
> -
> /*
> * Clean up any dead statistics collector entries for this DB. We always
> * want to do this exactly once per DB-processing cycle, even if we find
> @@ -2747,12 +2747,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
> if (isshared)
> {
> if (PointerIsValid(shared))
> - tabentry = hash_search(shared->tables, &relid,
> - HASH_FIND, NULL);
> + tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
> }
> else if (PointerIsValid(dbentry))
> - tabentry = hash_search(dbentry->tables, &relid,
> - HASH_FIND, NULL);
> + tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
>
> return tabentry;
> }

Why is pgstat_fetch_stat_tabentry_extended called "_extended"? Outside
the stats subsystem there are exactly one caller for the non extended
version, as far as I can see. That's index_concurrently_swap() - and imo
that's code that should live in the stats subsystem, rather than open
coded in index.c.

> diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> index ca5c6376e5..1ffe073a1f 100644
> --- a/src/backend/postmaster/pgstat.c
> +++ b/src/backend/postmaster/pgstat.c
> @@ -1,15 +1,23 @@
> /* ----------
> * pgstat.c
> *
> - * All the statistics collector stuff hacked up in one big, ugly file.
> + * Statistics collector facility.
> *
> - * TODO: - Separate collector, postmaster and backend stuff
> - * into different files.
> + * Collects per-table and per-function usage statistics of all backends on
> + * shared memory. pg_count_*() and friends are the interface to locally store
> + * backend activities during a transaction. Then pgstat_flush_stat() is called
> + * at the end of a transaction to pulish the local stats on shared memory.
> *

I'd rather not exhaustively list the different objects this handles -
it'll either be annoying to maintain, or just get out of date.

> - * - Add some automatic call for pgstat vacuuming.
> + * To avoid congestion on the shared memory, we update shared stats no more
> + * often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
> + * all the local numbers cannot be flushed immediately, we postpone updates
> + * and try the next chance after the interval of
> + * PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
> + * PGSTAT_STAT_MAX_INTERVAL(1000ms).

I'm not convinced by this backoff logic. The basic interval seems quite
high for something going through shared memory, and the max retry seems
pretty low.

> +/*
> + * Operation mode and return code of pgstat_get_db_entry.
> + */
> +#define PGSTAT_SHARED 0

This is unreferenced.

> +#define PGSTAT_EXCLUSIVE 1
> +#define PGSTAT_NOWAIT 2

And these should imo rather be parameters.

> +typedef enum PgStat_TableLookupResult
> +{
> + NOT_FOUND,
> + FOUND,
> + LOCK_FAILED
> +} PgStat_TableLookupResult;

This seems like a seriously bad idea to me. These are very generic
names. There's also basically no references except setting them to the
first two?

> +#define StatsLock (&StatsShmem->StatsMainLock)
>
> -static time_t last_pgstat_start_time;
> +/* Shared stats bootstrap information */
> +typedef struct StatsShmemStruct
> +{
> + LWLock StatsMainLock; /* lock to protect this struct */
> + dsa_handle stats_dsa_handle; /* DSA handle for stats data */
> + dshash_table_handle db_hash_handle;
> + dsa_pointer global_stats;
> + dsa_pointer archiver_stats;
> + int refcount;
> +} StatsShmemStruct;

Why isn't this an lwlock in lwlock in lwlocknames.h, rather than
allocated here?

> +/*
> + * BgWriter global statistics counters. The name cntains a remnant from the
> + * time when the stats collector was a dedicate process, which used sockets to
> + * send it.
> + */
> +PgStat_MsgBgWriter BgWriterStats = {0};

I am strongly against keeping the 'Msg' prefix. That seems extremely
confusing going forward.

> +/* common header of snapshot entry in reader snapshot hash */
> +typedef struct PgStat_snapshot
> +{
> + Oid key;
> + bool negative;
> + void *body; /* end of header part: to keep alignment */
> +} PgStat_snapshot;

> +/* context struct for snapshot_statentry */
> +typedef struct pgstat_snapshot_param
> +{
> + char *hash_name; /* name of the snapshot hash */
> + int hash_entsize; /* element size of hash entry */
> + dshash_table_handle dsh_handle; /* dsh handle to attach */
> + const dshash_parameters *dsh_params;/* dshash params */
> + HTAB **hash; /* points to variable to hold hash */
> + dshash_table **dshash; /* ditto for dshash */
> +} pgstat_snapshot_param;

Why does this exist? The struct contents are actually constant across
calls, yet you have declared them inside functions (as static - static
on function scope isn't really the same as global static).

If we want it, I think we should separate the naming more
meaningfully. The important difference between 'hash' and 'dshash' isn't
the hashing module, it's that one is a local copy, the other a shared
hashtable!

> +/*
> + * Backends store various database-wide info that's waiting to be flushed out
> + * to shared memory in these variables.
> + *
> + * checksum_failures is the exception in that it is cluster-wide value.
> + */
> +typedef struct BackendDBStats
> +{
> + int n_conflict_tablespace;
> + int n_conflict_lock;
> + int n_conflict_snapshot;
> + int n_conflict_bufferpin;
> + int n_conflict_startup_deadlock;
> + int n_deadlocks;
> + size_t n_tmpfiles;
> + size_t tmpfilesize;
> + HTAB *checksum_failures;
> +} BackendDBStats;

Why is this a separate struct from PgStat_StatDBEntry? We should have
these fields in multiple places.

> + if (StatsShmem->refcount > 0)
> + StatsShmem->refcount++;

What prevents us from leaking the refcount here? We could e.g. error out
while attaching, no? Which'd mean we'd leak the refcount.

To me it looks like there's a lot of added complexity just because you
want to be able to reset stats via

void
pgstat_reset_all(void)a
{

/*
* We could directly remove files and recreate the shared memory area. But
* detach then attach for simplicity.
*/
pgstat_detach_shared_stats(false); /* Don't write */
pgstat_attach_shared_stats();

Without that you'd not need the complexity of attaching, detaching to
the same degree - every backend could just cache lookup data during
initialization, instead of having to constantly re-compute that.

Nor would the dynamic re-creation of the db dshash table be needed.

> +/* ----------
> + * pgstat_report_stat() -
> + *
> + * Must be called by processes that performs DML: tcop/postgres.c, logical
> + * receiver processes, SPI worker, etc. to apply the so far collected
> + * per-table and function usage statistics to the shared statistics hashes.
> + *
> + * Updates are applied not more frequent than the interval of
> + * PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
> + * failure if force is false and there's no pending updates longer than
> + * PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
> + * succeeding calls of this function.
> + *
> + * Returns the time until the next timing when updates are applied in
> + * milliseconds if there are no updates holded for more than
> + * PGSTAT_STAT_MIN_INTERVAL milliseconds.
> + *
> + * Note that this is called only out of a transaction, so it is fine to use
> + * transaction stop time as an approximation of current time.
> + * ----------
> + */

Inconsistent indentation.

> +long
> +pgstat_report_stat(bool force)
> {

> + /* Flush out table stats */
> + if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> + pending_stats = true;
> +
> + /* Flush out function stats */
> + if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> + pending_stats = true;

This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
being confusing while reading the code, it also made the diff much
harder to read.

> - snprintf(fname, sizeof(fname), "%s/%s", directory,
> - entry->d_name);
> - unlink(fname);
> + /* Flush out database-wide stats */
> + if (HAVE_PENDING_DBSTATS())
> + {
> + if (!pgstat_flush_dbstats(&cxt, !force))
> + pending_stats = true;
> }

Linearly checking a number of stats doesn't seem like the right way
going forward. Also seems fairly omission prone.

Why does this code check live in pgstat_report_stat(), rather than
pgstat_flush_dbstats()?

> /*
> * snapshot_statentry() - Common routine for functions
> * pgstat_fetch_stat_*entry()
> *

Why has this function been added between the closely linked
pgstat_report_stat() and pgstat_flush_stat() etc?

> * Returns the pointer to a snapshot of a shared entry for the key or NULL if
> * not found. Returned snapshots are stable during the current transaction or
> * until pgstat_clear_snapshot() is called.
> *
> * The snapshots are stored in a hash, pointer to which is stored in the
> * *HTAB variable pointed by cxt->hash. If not created yet, it is created
> * using hash_name, hash_entsize in cxt.
> *
> * cxt->dshash points to dshash_table for dbstat entries. If not yet
> * attached, it is attached using cxt->dsh_handle.

Why do we still have this? A hashtable lookup is cheap, compared to
fetching a file - so it's not to save time. Given how infrequent the
pgstat_fetch_* calls are, it's not to avoid contention either.

At first one could think it's for consistency - but no, that's not it
either, because snapshot_statentry() refetches the snapshot without
control from the outside:

> /*
> * We don't want so frequent update of stats snapshot. Keep it at least
> * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
> */
> if (clear_snapshot)
> {
> clear_snapshot = false;
>
> if (pgStatSnapshotContext &&
> snapshot_globalStats.stats_timestamp <
> GetCurrentStatementStartTimestamp() -
> PGSTAT_STAT_MIN_INTERVAL * 1000)
> {
> MemoryContextReset(pgStatSnapshotContext);
>
> /* Reset variables */
> global_snapshot_is_valid = false;
> pgStatSnapshotContext = NULL;
> pgStatLocalHash = NULL;
>
> pgstat_setup_memcxt();
> }
> }

I think we should just remove this entire local caching snapshot layer
for lookups.

> /*
> * pgstat_flush_stat: Flushes table stats out to shared statistics.
> *

Why is this named pgstat_flush_stat, rather than pgstat_flush_tabstats
or such? Given that the code for dealing with an individual table's
entry is named pgstat_flush_tabstat() that's very confusing.

> * If nowait is true, returns false if required lock was not acquired
> * immediately. In that case, unapplied table stats updates are left alone in
> * TabStatusArray to wait for the next chance. cxt holds some dshash related
> * values that we want to carry around while updating shared stats.
> *
> * Returns true if all stats info are flushed. Caller must detach dshashes
> * stored in cxt after use.
> */
> static bool
> pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
> {
> static const PgStat_TableCounts all_zeroes;
> TabStatusArray *tsa;
> HTAB *new_tsa_hash = NULL;
> TabStatusArray *dest_tsa = pgStatTabList;
> int dest_elem = 0;
> int i;
>
> /* nothing to do, just return */
> if (pgStatTabHash == NULL)
> return true;
>
> /*
> * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
> * entries it points to.
> */
> hash_destroy(pgStatTabHash);
> pgStatTabHash = NULL;
>
> /*
> * Scan through the TabStatusArray struct(s) to find tables that actually
> * have counts, and try flushing it out to shared stats. We may fail on
> * some entries in the array. Leaving the entries being packed at the
> * beginning of the array.
> */
> for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
> {

It seems odd that there's a tabstat specific code in pgstat_flush_stat
(also note singular while it's processing all stats, whereas you're
below treating pgstat_flush_tabstat as only affecting one table).

> for (i = 0; i < tsa->tsa_used; i++)
> {
> PgStat_TableStatus *entry = &tsa->tsa_entries[i];
>
> /* Shouldn't have any pending transaction-dependent counts */
> Assert(entry->trans == NULL);
>
> /*
> * Ignore entries that didn't accumulate any actual counts, such
> * as indexes that were opened by the planner but not used.
> */
> if (memcmp(&entry->t_counts, &all_zeroes,
> sizeof(PgStat_TableCounts)) == 0)
> continue;
>
> /* try to apply the tab stats */
> if (!pgstat_flush_tabstat(cxt, nowait, entry))
> {
> /*
> * Failed. Move it to the beginning in TabStatusArray and
> * leave it.
> */
> TabStatHashEntry *hash_entry;
> bool found;
>
> if (new_tsa_hash == NULL)
> new_tsa_hash = create_tabstat_hash();
>
> /* Create hash entry for this entry */
> hash_entry = hash_search(new_tsa_hash, &entry->t_id,
> HASH_ENTER, &found);
> Assert(!found);
>
> /*
> * Move insertion pointer to the next segment if the segment
> * is filled up.
> */
> if (dest_elem >= TABSTAT_QUANTUM)
> {
> Assert(dest_tsa->tsa_next != NULL);
> dest_tsa = dest_tsa->tsa_next;
> dest_elem = 0;
> }
>
> /*
> * Pack the entry at the begining of the array. Do nothing if
> * no need to be moved.
> */
> if (tsa != dest_tsa || i != dest_elem)
> {
> PgStat_TableStatus *new_entry;
> new_entry = &dest_tsa->tsa_entries[dest_elem];
> *new_entry = *entry;
>
> /* use new_entry as entry hereafter */
> entry = new_entry;
> }
>
> hash_entry->tsa_entry = entry;
> dest_elem++;
> }

This seems like too much code. Why is this entirely different from the
way funcstats works? The difference was already too big before, but this
made it *way* worse.

One goal of this project, as I understand it, is to make it easier to
add additional stats. As is, this seems to make it harder from the code
level.

> bool
> pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
> PgStat_TableStatus *entry)
> {
> Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
> int table_mode = PGSTAT_EXCLUSIVE;
> bool updated = false;
> dshash_table *tabhash;
> PgStat_StatDBEntry *dbent;
> int generation;
>
> if (nowait)
> table_mode |= PGSTAT_NOWAIT;
>
> /* Attach required table hash if not yet. */
> if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
> {
> /*
> * Return if we don't have corresponding dbentry. It would've been
> * removed.
> */
> dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
> if (!dbent)
> return false;
>
> /*
> * We don't hold lock on the dbentry since it cannot be dropped while
> * we are working on it.
> */
> generation = pin_hashes(dbent);
> tabhash = attach_table_hash(dbent, generation);

This again is just cost incurred by insisting on destroying hashtables
instead of keeping them around as long as necessary.

> if (entry->t_shared)
> {
> cxt->shgeneration = generation;
> cxt->shdbentry = dbent;
> cxt->shdb_tabhash = tabhash;
> }
> else
> {
> cxt->mygeneration = generation;
> cxt->mydbentry = dbent;
> cxt->mydb_tabhash = tabhash;
>
> /*
> * We come here once per database. Take the chance to update
> * database-wide stats
> */
> LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
> dbent->n_xact_commit += pgStatXactCommit;
> dbent->n_xact_rollback += pgStatXactRollback;
> dbent->n_block_read_time += pgStatBlockReadTime;
> dbent->n_block_write_time += pgStatBlockWriteTime;
> LWLockRelease(&dbent->lock);
> pgStatXactCommit = 0;
> pgStatXactRollback = 0;
> pgStatBlockReadTime = 0;
> pgStatBlockWriteTime = 0;
> }
> }
> else if (entry->t_shared)
> {
> dbent = cxt->shdbentry;
> tabhash = cxt->shdb_tabhash;
> }
> else
> {
> dbent = cxt->mydbentry;
> tabhash = cxt->mydb_tabhash;
> }
>
>
> /*
> * Local table stats should be applied to both dbentry and tabentry at
> * once. Update dbentry only if we could update tabentry.
> */
> if (pgstat_update_tabentry(tabhash, entry, nowait))
> {
> pgstat_update_dbentry(dbent, entry);
> updated = true;
> }

At this point we're very deeply nested. pgstat_report_stat() ->
pgstat_flush_stat() -> pgstat_flush_tabstat() ->
pgstat_update_tabentry().

That's way over the top imo.

I don't think it makes much sense that pgstat_update_dbentry() is called
separately for each table. Why would we want to constantly lock that
entry? It seems to be much more sensible to instead have
pgstat_flush_stat() transfer the stats it reported to the pending
database wide counters, and then report that to shared memory *once* per
pgstat_report_stat() with pgstat_flush_dbstats()?

> /*
> * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
> *
> * If nowait is true, returns with false on lock failure on dbentry.
> *
> * Returns true if all stats are flushed out.
> */
> static bool
> pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
> {
> /* get dbentry if not yet */
> if (cxt->mydbentry == NULL)
> {
> int op = PGSTAT_EXCLUSIVE;
> if (nowait)
> op |= PGSTAT_NOWAIT;
>
> cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
>
> /* return if lock failed. */
> if (cxt->mydbentry == NULL)
> return false;
>
> /* we use this generation of table /function stats in this turn */
> cxt->mygeneration = pin_hashes(cxt->mydbentry);
> }
>
> LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
> if (HAVE_PENDING_CONFLICTS())
> pgstat_flush_recovery_conflict(cxt->mydbentry);
> if (BeDBStats.n_deadlocks != 0)
> pgstat_flush_deadlock(cxt->mydbentry);
> if (BeDBStats.n_tmpfiles != 0)
> pgstat_flush_tempfile(cxt->mydbentry);
> if (BeDBStats.checksum_failures != NULL)
> pgstat_flush_checksum_failure(cxt->mydbentry);
> LWLockRelease(&cxt->mydbentry->lock);

What's the point of having all these sub-functions? I see that you, for
an undocumented reason, have pgstat_report_recovery_conflict() flush
conflict stats immediately:

> dbentry = pgstat_get_db_entry(MyDatabaseId,
> PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
> &status);
>
> if (status == LOCK_FAILED)
> return;
>
> /* We had a chance to flush immediately */
> pgstat_flush_recovery_conflict(dbentry);
>
> dshash_release_lock(pgStatDBHash, dbentry);

But I don't understand why? Nor why we'd not just report all pending
database wide changes in that case?

The fact that you're locking the per-database entry unconditionally once
for each table almost guarantees contention - and you're not using the
'conditional lock' approach for that. I don't understand.

> /* ----------
> * pgstat_vacuum_stat() -
> *
> * Remove objects we can get rid of.
> * ----------
> */
> void
> pgstat_vacuum_stat(void)
> {
> HTAB *oidtab;
> dshash_seq_status dshstat;
> PgStat_StatDBEntry *dbentry;
>
> /* we don't collect stats under standalone mode */
> if (!IsUnderPostmaster)
> return;
>
> /*
> * Read pg_database and make a list of OIDs of all existing databases
> */
> oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
>
> /*
> * Search the database hash table for dead databases and drop them
> * from the hash.
> */
>
> dshash_seq_init(&dshstat, pgStatDBHash, false, true);
> while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
> {
> Oid dbid = dbentry->databaseid;
>
> CHECK_FOR_INTERRUPTS();
>
> /* the DB entry for shared tables (with InvalidOid) is never dropped */
> if (OidIsValid(dbid) &&
> hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
> pgstat_drop_database(dbid);
> }
>
> /* Clean up */
> hash_destroy(oidtab);

So, uh, pgstat_drop_database() again does a *separate* lookup in the
dshash, locking the entry. Which only works because you added this dirty
hack:

/* We need to keep partition lock while sequential scan */
if (!hash_table->seqscan_running)
{
hash_table->find_locked = false;
hash_table->find_exclusively_locked = false;
LWLockRelease(PARTITION_LOCK(hash_table, partition));
}

to dshash_delete_entry(). This seems insane to me. There's not even a
comment explaining this?

> /*
> * Similarly to above, make a list of all known relations in this DB.
> */
> oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
>
> /*
> * Check for all tables listed in stats hashtable if they still exist.
> * Stats cache is useless here so directly search the shared hash.
> */
> pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
>
> /*
> * Repeat the above but we needn't bother in the common case where no
> * function stats are being collected.
> */
> if (dbentry->functions != DSM_HANDLE_INVALID)
> {
> oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
>
> pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
> oidtab);
> }
> dshash_release_lock(pgStatDBHash, dbentry);

Wait, why are we holding the database partition lock across all this?
Again without any comments explaining why?

> +void
> +pgstat_send_archiver(const char *xlog, bool failed)

Why do we still have functions named pgstat_send*?

Greetings,

Andres Freund

In response to

Re: shared-memory based stats collector at 2020-01-22 08:24:04 from Kyotaro Horiguchi

Responses

Re: shared-memory based stats collector at 2020-03-13 07:34:50 from Kyotaro Horiguchi
Re: shared-memory based stats collector at 2020-03-19 11:30:04 from Kyotaro Horiguchi

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	movead.li@highgo.ca	2020-03-13 03:18:47	Re: Re: A bug when use get_bit() function for a long bytea string
Previous Message	Dilip Kumar	2020-03-13 03:12:05	Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager