Quick Links

Re: shared-memory based stats collector

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	andres(at)anarazel(dot)de
Cc:	michael(at)paquier(dot)xyz, alvherre(at)2ndquadrant(dot)com, thomas(dot)munro(at)gmail(dot)com, tomas(dot)vondra(at)2ndquadrant(dot)com, a(dot)zakirov(at)postgrespro(dot)ru, ah(at)cybertec(dot)at, magnus(at)hagander(dot)net, robertmhaas(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: shared-memory based stats collector
Date:	2020-03-19 11:30:04
Message-ID:	20200319.203004.262929931224076029.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Thank you for the comment.

The new version is attached.

At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund <andres(at)anarazel(dot)de> wrote in
> General comments about this series:
> - A lot of the review comments feel like I've written them before, a
> year or more ago. I feel this patch ought to be in a much better
> state. There's a lot of IMO fairly obvious stuff here, and things that
> have been mentioned multiple times previously.
> - There's a *lot* of typos in here. I realize being an ESL is hard, but
> a lot of these can be found with the simplest spellchecker. That's
> one thing for a patch that just has been hacked up as a POC, but this
> is a multi year thread?
> - There's some odd formatting. Consider using pgindent more regularly.
>
> More detailed comments below.
>
> I'm considering rewriting the parts of the patchset that I don't like -
> but it'll look quite different afterwards.
>
> On 2020-01-22 17:24:04 +0900, Kyotaro Horiguchi wrote:
> > From 5f7946522dc189429008e830af33ff2db435dd42 Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > Date: Fri, 29 Jun 2018 16:41:04 +0900
> > Subject: [PATCH 1/5] sequential scan for dshash
> >
> > Add sequential scan feature to dshash.
>
>
> > dsa_pointer item_pointer = hash_table->buckets[i];
> > @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
> > LW_EXCLUSIVE));
> >
> > delete_item(hash_table, item);
> > - hash_table->find_locked = false;
> > - hash_table->find_exclusively_locked = false;
> > - LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > +
> > + /* We need to keep partition lock while sequential scan */
> > + if (!hash_table->seqscan_running)
> > + {
> > + hash_table->find_locked = false;
> > + hash_table->find_exclusively_locked = false;
> > + LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > + }
> > }
>
> This seems like a failure prone API.

[001] (Fixed)
As the result of the fixed in [044], it's gone now.

> > /*
> > @@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entr> > + * dshash_seq_init/_next/_term
> > + * Sequentially scan trhough dshash table and return all the
> > + * elements one by one, return NULL when no more.
>
> s/trhough/through/

[002] (Fixed)

> This uses a different comment style that the other functions in this
> file. Why?

[003] (Fixed)

It was following the equivalent in dynahash.c. I rewrote it different
way.

> > + * dshash_seq_term should be called for incomplete scans and otherwise
> > + * shoudln't. Finished scans are cleaned up automatically.
>
> s/shoudln't/shouldn't/

[004] (Fixed)

> I find the "cleaned up automatically" API terrible. I know you copied it
> from dynahash, but I find it to be really failure prone. dynahash isn't
> an example of good postgres code, the opposite, I'd say. It's a lot
> easier to unconditionally have a terminate call if we need that.

[005] (Fixed)
OK, I remember I had a similar thought on this. Fixed with the
all correspondents.

> > + * Returned elements are locked as is the case with dshash_find. However, the
> > + * caller must not release the lock.
> > + *
> > + * Same as dynanash, the caller may delete returned elements midst of a scan.
>
> I think it's a bad idea to refer to dynahash here. That's just going to
> get out of date. Also, code should be documented on its own.

[006] (Fixed)
Understood, fixed as the follows.

* Returned elements are locked and the caller must not explicitly release
* it.

> > + * If consistent is set for dshash_seq_init, the all hash table partitions are
> > + * locked in the requested mode (as determined by the exclusive flag) during
> > + * the scan. Otherwise partitions are locked in one-at-a-time way during the
> > + * scan.
>
> Yet delete unconditionally retains locks?

[007] (Not fixed)
Yes. If we release the lock on the current partition, hash resize
breaks the concurrent scans.

> > + */
> > +void
> > +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
> > + bool consistent, bool exclusive)
> > +{
>
> Why does this patch add the consistent mode? There's no users currently?
> Without it's not clear that we need a seperate _term function, I think?

[008] (Fixed)
I remember that it is used in early stage of development. I left it
for a matter of API completeness but actually it is not used. _term is
another matter. We need to release lock and clean up some dshash
status if we allow seq scan to be exited before it reaches to the end.

I removed the "consistent" from dshash_seq_init and reverted
dshash_seq_term.

> I think we also can get rid of the dshash_delete changes, by instead
> adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
> or such.

[009] (Fixed)
I'm not sure about the point of having two interfaces that are hard to
distinguish. Maybe dshash_delete_current(dshash_seq_stat *status) is
enough(). I also reverted the dshash_delete().

> > @@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
> > extern void dshash_detach(dshash_table *hash_table);
> > extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
> > extern void dshash_destroy(dshash_table *hash_table);
> > -
> > /* Finding, creating, deleting entries. */
> > extern void *dshash_find(dshash_table *hash_table,
> > const void *key, bool
> > exclusive);
>
> There's a number of spurious changes like this.

[010] (Fixed)
I found such isolated line insertion or removal, two in 0001, eight in
0004.

> > From 60da67814fe40fd2a0c1870b15dcf6fcb21c989a Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > Date: Thu, 27 Sep 2018 11:15:19 +0900
> > Subject: [PATCH 2/5] Add conditional lock feature to dshash
> >
> > Dshash currently waits for lock unconditionally. This commit adds new
> > interfaces for dshash_find and dshash_find_or_insert. The new
> > interfaces have an extra parameter "nowait" taht commands not to wait
> > for lock.
>
> s/taht/that/

[011] (Fixed)
Applied ispell on all commit messages.

> There should be at least a sentence or two explaining why these are
> useful.

[011] (Fixed)
Sounds reasonable. I rewrote it that way.

> > +/*
> > + * The version of dshash_find, which is allowed to return immediately on lock
> > + * failure. Lock status is set to *lock_failed in that case.
> > + */
>
> Hm. Not sure I like the *lock_acquired API.
>
> > +void *
> > +dshash_find_extended(dshash_table *hash_table, const void *key,
> > + bool exclusive, bool nowait, bool *lock_acquired)
...
> > + Assert(nowait || !lock_acquired);
...
> > + if (lock_acquired)
> > + *lock_acquired = false;
>
> Why is the test for lock_acquired needed here? I don't think it's
> possible to use nowait correctly without passing in lock_acquired?
>
> Think it'd make sense to document & assert that nowait = true implies
> lock_acquired set, and nowait = false implies lock_acquired not being
> set.
>
> But, uh, why do we even need the lock_acquired parameter? If we couldn't
> find an entry, then we should just release the lock, no?

[012] (Fixed) (related to [013], [014])
The name is confusing. In this version the old dshash_find_extended
and dshash_find_or_insert_extended are merged into new
dshash_find_extended, which covers all the functions of dshash_find
and dshash_find_or_insert plus insertion with shared lock is allowed
now.

> I'm however inclined to think it's better to just have a separate
> function for the nowait case, rather than an extended version supporting
> both (with an internal helper doing most of the work).

[013] (Fixed) (related to [012], [014])
After some thoughts, the nowait is no longer a matter of complexity.
Finally I did as [012].

> > +/*
> > + * The version of dshash_find_or_insert, which is allowed to return immediately
> > + * on lock failure.
> > + *
> > + * Notes above dshash_find_extended() regarding locking and error handling
> > + * equally apply here.
>
> They don't, there's no lock_acquired parameter.
>
> > + */
> > +void *
> > +dshash_find_or_insert_extended(dshash_table *hash_table,
> > + const void *key,
> > + bool *found,
> > + bool nowait)
>
> I think it's absurd to have dshash_find, dshash_find_extended,
> dshash_find_or_insert, dshash_find_or_insert_extended. If they're
> extended they should also be able to specify whether the entry will get
> created.

[014] (Fixed) (related to [012], [013])
As mentioned above, this version has the original two functions and
one dshash_find_extended().

> > From d10c1117cec77a474dbb2cff001086d828b79624 Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > Date: Wed, 7 Nov 2018 16:53:49 +0900
> > Subject: [PATCH 3/5] Make archiver process an auxiliary process
> >
> > This is a preliminary patch for shared-memory based stats collector.
> > Archiver process must be a auxiliary process since it uses shared
> > memory after stats data wes moved onto shared-memory. Make the process
>
> s/wes/was/ s/onto/into/

[015] (Fixed)

> > an auxiliary process in order to make it work.
>
> >
>
> > @@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
> > StartupProcessMain();
> > proc_exit(1); /* should never return */
> >
> > + case ArchiverProcess:
> > + /* don't set signals, archiver has its own agenda */
> > + PgArchiverMain();
> > + proc_exit(1); /* should never return */
> > +
> > case BgWriterProcess:
> > /* don't set signals, bgwriter has its own agenda */
> > BackgroundWriterMain();
>
> I think I'd rather remove the two comments that are copied to 6 out of 8
> cases - they don't add anything.

[016] (Fixed)
Agreed. I removed the comments from StartProcess to WalReceiverProcess.

> > pgarch_exit(SIGNAL_ARGS)
> > {
..
> > + * We DO NOT want to run proc_exit() callbacks -- we're here because
> > + * shared memory may be corrupted, so we don't want to try to clean up our
...
> > + * being doubly sure.)
> > + */
> > + exit(2);
...
> This seems to be a copy of code & comments from other signal handlers that predates
..
> I think this just should use SignalHandlerForCrashExit().
> I think we can even commit that separately - there's not really a reason
> to not do that today, as far as I can tell?

[017] (Fixed, separate patch 0001)
Exactly. Although on_*_exit_list is empty on the process, SIGQUIT
ought to prevent the process from calling the functions even if
any. That changes the exit status of archiver on SIGQUIT from 1 to 2,
but that doesn't make any behavior change (other than log message).

> > /* SIGUSR1 signal handler for archiver process */
>
> Hm - this currently doesn't set up a correct sigusr1 handler for a
> shared memory backend - needs to invoke procsignal_sigusr1_handler
> somewhere.
>
> We can probably just convert to using normal latches here, and remove
> the current 'wakened' logic? That'll remove the indirection via
> postmaster too, which is nice.

[018] (Fixed, separate patch 0005)
It seems better. I added it as a separate patch just after the patch
that turns archiver an auxiliary process.

> > @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
> >
> > switch (backendType)
> > {
> > + case B_ARCHIVER:
> > + backendDesc = "archiver";
> > + break;
>
> should imo include 'WAL' or such.

[019] (Not Fixed)
It is already named "archiver" by 8e8a0becb3. Do I rename it in this
patch set?

> > From 5079583c447c3172aa0b4f8c0f0a46f6e1512812 Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > Date: Thu, 21 Feb 2019 12:44:56 +0900
> > Subject: [PATCH 4/5] Shared-memory based stats collector
..
> > megabytes. To deal with larger statistics set, this patch let backends
> > directly share the statistics via shared memory.
>
> This spends a fair bit describing the old state, but very little
> describing the new state.

[020] (Fixed, Maybe)
Ugg. I got the same comment in the last round. I rewrote it this time.

> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > index 0bfd6151c4..a6b0bdec12 100644
> > --- a/doc/src/sgml/monitoring.sgml
> > +++ b/doc/src/sgml/monitoring.sgml
...
> > - master process. (The <quote>stats collector</quote> process will not be present
> > - if you have set the system not to start the statistics collector; likewise
> > + master process. (The <quote>autovacuum launcher</quote> process will not
...
> There's more references to the stats collector than this... E.g. in
> catalogs.sgml

[021] (Fixed, separate patch 0007)
However the "statistics collector process" is gone, I'm not sure
"statistics collector" feature also is gone. But actually the word
"collector" looks a bit odd in some context. I replaced "the results
of statistics collector" with "the activity statistics". (I'm not sure
"the activity statistics" is proper as a subsystem name.) The word
"collect" is replaced with "track". I didn't change section IDs
corresponding to the renaming so that old links can work. I also fixed
the tranche name for LWTRANCHE_STATS from "activity stats" to
"activity_statistics"

> > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> > index 6d1f28c327..8dcb0fb7f7 100644
> > --- a/src/backend/postmaster/autovacuum.c
> > +++ b/src/backend/postmaster/autovacuum.c
...
> > @@ -2747,12 +2747,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
> > if (isshared)
> > {
> > if (PointerIsValid(shared))
> > - tabentry = hash_search(shared->tables, &relid,
> > - HASH_FIND, NULL);
> > + tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
> > }
> > else if (PointerIsValid(dbentry))
> > - tabentry = hash_search(dbentry->tables, &relid,
> > - HASH_FIND, NULL);
> > + tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
> >
> > return tabentry;
> > }
>
> Why is pgstat_fetch_stat_tabentry_extended called "_extended"? Outside

[022] (Fixed)
The _extended function is not an extended version of the original
function. I renamed pgstat_fetch_stat_tabentry_extended to
pgstat_fetch_stat_tabentry_snapshot. Also
pgstat_fetch_funcentry_extended and pgstat_fetch_dbentry() are renamed
following that.

> the stats subsystem there are exactly one caller for the non extended
> version, as far as I can see. That's index_concurrently_swap() - and imo
> that's code that should live in the stats subsystem, rather than open
> coded in index.c.

[023] (Fixed)
Agreed. I added a new function "pgstat_copy_index_counters()" and now
pgstat_fetch_stat_tabentry() has no caller sites outside pgstat
subsystem.

> > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> > index ca5c6376e5..1ffe073a1f 100644
> > --- a/src/backend/postmaster/pgstat.c
> > +++ b/src/backend/postmaster/pgstat.c
> > + * Collects per-table and per-function usage statistics of all backends on
> > + * shared memory. pg_count_*() and friends are the interface to locally store
> > + * backend activities during a transaction. Then pgstat_flush_stat() is called
> > + * at the end of a transaction to pulish the local stats on shared memory.
> > *
>
> I'd rather not exhaustively list the different objects this handles -
> it'll either be annoying to maintain, or just get out of date.

[024] (Fixed, Maybe)
Although not sure I get you correctly, I rewrote it as the follows.

* Collects per-table and per-function usage statistics of all backends on
* shared memory. The activity numbers are once stored locally, then written
* to shared memory at commit time or by idle-timeout.

> > - * - Add some automatic call for pgstat vacuuming.
> > + * To avoid congestion on the shared memory, we update shared stats no more
> > + * often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
> > + * all the local numbers cannot be flushed immediately, we postpone updates
> > + * and try the next chance after the interval of
> > + * PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
> > + * PGSTAT_STAT_MAX_INTERVAL(1000ms).
>
> I'm not convinced by this backoff logic. The basic interval seems quite
> high for something going through shared memory, and the max retry seems
> pretty low.

[025] (Not Fixed)
Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
10000) reasonable?

> > +/*
> > + * Operation mode and return code of pgstat_get_db_entry.
> > + */
> > +#define PGSTAT_SHARED 0
>
> This is unreferenced.
>
>
> > +#define PGSTAT_EXCLUSIVE 1
> > +#define PGSTAT_NOWAIT 2
>
> And these should imo rather be parameters.

[026] (Fixed)
Mmm. Right. The two symbols conveys just two distinct parameters. Two
booleans suffice. But I found some confusion here. As the result
pgstat_get_db_entry have three booleans parameters exclusive, nowait
and create.

> > +typedef enum PgStat_TableLookupResult
> > +{
> > + NOT_FOUND,
> > + FOUND,
> > + LOCK_FAILED
> > +} PgStat_TableLookupResult;
>
> This seems like a seriously bad idea to me. These are very generic
> names. There's also basically no references except setting them to the
> first two?

[027] (Fixed)
Considering some related comments above, I decided not to return lock
status from pgstat_get_db_entry. That makes the enum useless and makes
the function simpler.

> > +#define StatsLock (&StatsShmem->StatsMainLock)
> >
> > -static time_t last_pgstat_start_time;
> > +/* Shared stats bootstrap information */
> > +typedef struct StatsShmemStruct
> > +{
> > + LWLock StatsMainLock; /* lock to protect this struct */
...
> > +} StatsShmemStruct;
>
> Why isn't this an lwlock in lwlock in lwlocknames.h, rather than
> allocated here?

[028] (Fixed)
The activity stats system already used a dedicate tranche, so I might
think that it is natural that it is in the same tranche. Actually it's
not so firm reason. Moved the lock into main tranche.

> > +/*
> > + * BgWriter global statistics counters. The name cntains a remnant from the
> > + * time when the stats collector was a dedicate process, which used sockets to
> > + * send it.
> > + */
> > +PgStat_MsgBgWriter BgWriterStats = {0};
>
> I am strongly against keeping the 'Msg' prefix. That seems extremely
> confusing going forward.

[029] (Fixed) (Related to [046])
Mmm. It's following your old suggestion to avoid unsubstantial
diffs. I'm happy to change it. The functions that have "send" in their
names are for the same reason. I removed the prefix "m_" of the
members of the struct. (The comment above (with a typo) explains that).

> > +/* common header of snapshot entry in reader snapshot hash */
> > +typedef struct PgStat_snapshot
> > +{
> > + Oid key;
> > + bool negative;
> > + void *body; /* end of header part: to keep alignment */
> > +} PgStat_snapshot;
>
>
> > +/* context struct for snapshot_statentry */
> > +typedef struct pgstat_snapshot_param
> > +{
> > + char *hash_name; /* name of the snapshot hash */
> > + int hash_entsize; /* element size of hash entry */
> > + dshash_table_handle dsh_handle; /* dsh handle to attach */
> > + const dshash_parameters *dsh_params;/* dshash params */
> > + HTAB **hash; /* points to variable to hold hash */
> > + dshash_table **dshash; /* ditto for dshash */
> > +} pgstat_snapshot_param;
>
> Why does this exist? The struct contents are actually constant across
> calls, yet you have declared them inside functions (as static - static
> on function scope isn't really the same as global static).

[030] (Fixed)
IIUC, I didn't want it initialized at every call and it doesn't need
an external linkage. So it was static variable on function scope.
But, first, the name _param is bogus since it actually contains
context variables. Second the "context" variables have been moved to
other variables. I removed the struct and moved the members to the
parameter of snapshot_statentry.

> If we want it, I think we should separate the naming more
> meaningfully. The important difference between 'hash' and 'dshash' isn't
> the hashing module, it's that one is a local copy, the other a shared
> hashtable!

[031] (Fixed)
Definitely. The parameters of snapshot_statentry now have more
meaningful names.

> > +/*
> > + * Backends store various database-wide info that's waiting to be flushed out
> > + * to shared memory in these variables.
> > + *
> > + * checksum_failures is the exception in that it is cluster-wide value.
> > + */
> > +typedef struct BackendDBStats
> > +{
> > + int n_conflict_tablespace;
> > + int n_conflict_lock;
> > + int n_conflict_snapshot;
> > + int n_conflict_bufferpin;
> > + int n_conflict_startup_deadlock;
> > + int n_deadlocks;
> > + size_t n_tmpfiles;
> > + size_t tmpfilesize;
> > + HTAB *checksum_failures;
> > +} BackendDBStats;
>
> Why is this a separate struct from PgStat_StatDBEntry? We should have
> these fields in multiple places.

[032] (Fixed, Maybe) (Related to [042])
It is almost a subset of PgStat_StatDBEntry with an
exception. checksum_failures is different between the two.
Anyway, tracking of conflict events don't need to be so fast so they
have been changed to be counted on shared hash entries directly.

Checkpoint failure is handled different way so only it is left alone.

> > + if (StatsShmem->refcount > 0)
> > + StatsShmem->refcount++;
>
> What prevents us from leaking the refcount here? We could e.g. error out
> while attaching, no? Which'd mean we'd leak the refcount.

[033] (Fixed)
We don't attach shared stats on postmaster process, so I want to know
the first attacher process and the last detacher process of shared
stats. It's not leaks that I'm considering here.
(continued below)

> To me it looks like there's a lot of added complexity just because you
> want to be able to reset stats via
>
> void
> pgstat_reset_all(void)
> {
>
> /*
> * We could directly remove files and recreate the shared memory area. But
> * detach then attach for simplicity.
> */
> pgstat_detach_shared_stats(false); /* Don't write */
> pgstat_attach_shared_stats();
>
> Without that you'd not need the complexity of attaching, detaching to
> the same degree - every backend could just cache lookup data during
> initialization, instead of having to constantly re-compute that.

Mmm. I don't get that (or I failed to read clear meaning). The
function is assumed be called only from StartupXLOG().
(continued)

> Nor would the dynamic re-creation of the db dshash table be needed.

Maybe you are mentioning the complexity of reset_dbentry_counters? It
is actually complex. Shared stats dshash cannot be destroyed (or
dshash entry cannot be removed) during someone is working on it. It
was simpler to wait for another process to end its work but that could
slow not only the clearing process but also other processes by
frequent resetting of counters.

After some thoughts, I decided to rip the all "generation" stuff off
and it gets far simpler. But counter reset may conflict with other
backends with a litter higher degree because counter reset needs
exclusive lock.

> > +/* ----------
> > + * pgstat_report_stat() -
> > + *
> > + * Must be called by processes that performs DML: tcop/postgres.c, logical
> > + * receiver processes, SPI worker, etc. to apply the so far collected
> > + * per-table and function usage statistics to the shared statistics hashes.
> > + *
> > + * Updates are applied not more frequent than the interval of
> > + * PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
...
> Inconsistent indentation.

[034] (Fixed)

> > +long
> > +pgstat_report_stat(bool force)
> > {
>
> > + /* Flush out table stats */
> > + if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> > + pending_stats = true;
> > +
> > + /* Flush out function stats */
> > + if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> > + pending_stats = true;
>
> This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
> on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
> being confusing while reading the code, it also made the diff much
> harder to read.

[035] (Maybe Fixed)
Is the question that, is there any case where
pgstat_flush_stat/functions leaves some counters unflushed? It skips
tables where someone else is working on (or another table that is in
the same dshash partition), Or "!force == nowait" is the cause of
confusion? It is now translated as "nowait = !force". (Or change the
parameter of pgstat_report_stat from "force" to "nowait"?)

> > - snprintf(fname, sizeof(fname), "%s/%s", directory,
> > - entry->d_name);
> > - unlink(fname);
> > + /* Flush out database-wide stats */
> > + if (HAVE_PENDING_DBSTATS())
> > + {
> > + if (!pgstat_flush_dbstats(&cxt, !force))
> > + pending_stats = true;
> > }
>
> Linearly checking a number of stats doesn't seem like the right way
> going forward. Also seems fairly omission prone.
>
> Why does this code check live in pgstat_report_stat(), rather than
> pgstat_flush_dbstats()?

[036] (Maybe Fixed) (Related to [041])
It's to avoid useless calls but it is no longer exists. Anyway the
code disappeared by [041].

| /* Flush out individual stats tables */
| pending_stats |= pgstat_flush_stat(&cxt, nowait);
| pending_stats |= pgstat_flush_funcstats(&cxt, nowait);
| pending_stats |= pgstat_flush_checksum_failure(cxt.mydbentry, nowait);

> > /*
> > * snapshot_statentry() - Common routine for functions
> > * pgstat_fetch_stat_*entry()
> > *
>
> Why has this function been added between the closely linked
> pgstat_report_stat() and pgstat_flush_stat() etc?

[037]
It seems to be left there after some editing. Moved it to just before
the caller functdions.

> Why do we still have this? A hashtable lookup is cheap, compared to
> fetching a file - so it's not to save time. Given how infrequent the
> pgstat_fetch_* calls are, it's not to avoid contention either.
>
> At first one could think it's for consistency - but no, that's not it
> either, because snapshot_statentry() refetches the snapshot without
> control from the outside:

[038]
I don't get the second paragraph. When the function re*create*s a
snapshot without control from the outside? It keeps snapshots during a
transaction. If not, it is broken.
(continued)

> > /*
> > * We don't want so frequent update of stats snapshot. Keep it at least
> > * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
> > */
...
> I think we should just remove this entire local caching snapshot layer
> for lookups.

Currently the behavior is documented as the follows and it seems reasonable.

Another important point is that when a server process is asked to display
any of these statistics, it first fetches the most recent report emitted by
the collector process and then continues to use this snapshot for all
statistical views and functions until the end of its current transaction.
So the statistics will show static information as long as you continue the
current transaction. Similarly, information about the current queries of
all sessions is collected when any such information is first requested
within a transaction, and the same information will be displayed throughout
the transaction.
This is a feature, not a bug, because it allows you to perform several
queries on the statistics and correlate the results without worrying that
the numbers are changing underneath you. But if you want to see new
results with each query, be sure to do the queries outside any transaction
block. Alternatively, you can invoke
<function>pg_stat_clear_snapshot</function>(), which will discard the
current transaction's statistics snapshot (if any). The next use of
statistical information will cause a new snapshot to be fetched.

> > /*
> > * pgstat_flush_stat: Flushes table stats out to shared statistics.
> > *
>
> Why is this named pgstat_flush_stat, rather than pgstat_flush_tabstats
> or such? Given that the code for dealing with an individual table's
> entry is named pgstat_flush_tabstat() that's very confusing.

[039]
Definitely. The names are hchanged while adressing [041]

> > static bool
> > pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
...
> It seems odd that there's a tabstat specific code in pgstat_flush_stat
> (also note singular while it's processing all stats, whereas you're
> below treating pgstat_flush_tabstat as only affecting one table).

[039]
The names are hchanged while adressing [041]

> > for (i = 0; i < tsa->tsa_used; i++)
> > {
> > PgStat_TableStatus *entry = &tsa->tsa_entries[i];
> >
<many TableStatsArray code>
> > hash_entry->tsa_entry = entry;
> > dest_elem++;
> > }
>
> This seems like too much code. Why is this entirely different from the
> way funcstats works? The difference was already too big before, but this
> made it *way* worse.

[040]
We don't flush stats until transaction ends. So the description about
TabStatuArray is stale?

* NOTE: once allocated, TabStatusArray structures are never moved or deleted
* for the life of the backend. Also, we zero out the t_id fields of the
* contained PgStat_TableStatus structs whenever they are not actively in use.
* This allows relcache pgstat_info pointers to be treated as long-lived data,
* avoiding repeated searches in pgstat_initstats() when a relation is
* repeatedly opened during a transaction.
(continued to below)

> One goal of this project, as I understand it, is to make it easier to
> add additional stats. As is, this seems to make it harder from the code
> level.

Indeed. I removed the TabStatsArray. Having said that it lives a long
life, its life lasts for at most transaction end. I used dynahash
entry as pgstat_info entry. One tricky part is I had to clear
entry->t_id after removal of the entry so that pgstat_initstats can
detect the removal. It is actually safe but we can add another table
id member in the struct for the use.

> > bool
> > pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
> > PgStat_TableStatus *entry)
> > {
> > Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
> > int table_mode = PGSTAT_EXCLUSIVE;
> > bool updated = false;
> > dshash_table *tabhash;
> > PgStat_StatDBEntry *dbent;
> > int generation;
> >
> > if (nowait)
> > table_mode |= PGSTAT_NOWAIT;
> >
> > /* Attach required table hash if not yet. */
> > if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
> > {
> > /*
> > * Return if we don't have corresponding dbentry. It would've been
> > * removed.
> > */
> > dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
> > if (!dbent)
> > return false;
> >
> > /*
> > * We don't hold lock on the dbentry since it cannot be dropped while
> > * we are working on it.
> > */
> > generation = pin_hashes(dbent);
> > tabhash = attach_table_hash(dbent, generation);
>
> This again is just cost incurred by insisting on destroying hashtables
> instead of keeping them around as long as necessary.

[040]
Maybe you are insisting the reverse? The pin_hash complexity is left
in this version. -> [033]

> > /*
> > * Local table stats should be applied to both dbentry and tabentry at
> > * once. Update dbentry only if we could update tabentry.
> > */
> > if (pgstat_update_tabentry(tabhash, entry, nowait))
> > {
> > pgstat_update_dbentry(dbent, entry);
> > updated = true;
> > }
>
> At this point we're very deeply nested. pgstat_report_stat() ->
> pgstat_flush_stat() -> pgstat_flush_tabstat() ->
> pgstat_update_tabentry().
>
> That's way over the top imo.

[041] (Fixed) (Related to [036])
Completely agree, It is a result of that I wanted to avoid scanning
pgStatTables twice.
(continued)

> I don't think it makes much sense that pgstat_update_dbentry() is called
> separately for each table. Why would we want to constantly lock that
> entry? It seems to be much more sensible to instead have
> pgstat_flush_stat() transfer the stats it reported to the pending
> database wide counters, and then report that to shared memory *once* per
> pgstat_report_stat() with pgstat_flush_dbstats()?

In the attched it scans PgStat_StatDBEntry twice. Once for the tables
of current database and another for shared tables. That change
simplified the logic around.

pgstat_report_stat()
pgstat_flush_tabstats(<tables of current dataase>)
pgstat_update_tabentry() (at bottom)

> > LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
> > if (HAVE_PENDING_CONFLICTS())
> > pgstat_flush_recovery_conflict(cxt->mydbentry);
> > if (BeDBStats.n_deadlocks != 0)
> > pgstat_flush_deadlock(cxt->mydbentry);
..
> What's the point of having all these sub-functions? I see that you, for
> an undocumented reason, have pgstat_report_recovery_conflict() flush
> conflict stats immediately:

[042]
Fixed by [032].

> > dbentry = pgstat_get_db_entry(MyDatabaseId,
> > PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
> > &status);
> >
> > if (status == LOCK_FAILED)
> > return;
> >
> > /* We had a chance to flush immediately */
> > pgstat_flush_recovery_conflict(dbentry);
> >
> > dshash_release_lock(pgStatDBHash, dbentry);
>
> But I don't understand why? Nor why we'd not just report all pending
> database wide changes in that case?
>
> The fact that you're locking the per-database entry unconditionally once
> for each table almost guarantees contention - and you're not using the
> 'conditional lock' approach for that. I don't understand.

[043] (Maybe fixed) (Related to [045].)
Vacuum, analyze, DROP DB and reset cannot be delayed. So the
conditional lock is mainly used by
pgstat_report_stat(). dshash_find_or_insert didn't allow shared
lock. I changed dshash_find_extended to allow shared-lock even if it
is told to create a missing entry. Alrhough it takes exclusive lock at
the mement of entry creation, most of all cases it doesn't need
exclusive lock. This allows use shared lock while processing vacuum or
analyze stats.

Previously I thought that we can work on a shared database entry while
lock is not held, but actually there are cases where insertion of a
new database entry causes rehash (resize). The operation moves entries
so we need at least shared lock on database entry while we are working
on it. So in the attched basically most operations are working by the
following steps.

- get shared database entry with shared lock
- attach table/function hash
- fetch an entry with exclusive lock
- update entry
- release the table/function entry
- detach table/function hash

if needed
- take LW_EXCLUSIVE on database entry
- update database numbers
- release LWLock
- release shared database entry

> > pgstat_vacuum_stat(void)
> > {
...
> > dshash_seq_init(&dshstat, pgStatDBHash, false, true);
> > while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
> > pgstat_drop_database(dbid);
..
> So, uh, pgstat_drop_database() again does a *separate* lookup in the
> dshash, locking the entry. Which only works because you added this dirty
> hack:
>
> /* We need to keep partition lock while sequential scan */
> if (!hash_table->seqscan_running)
> {
> hash_table->find_locked = false;
> hash_table->find_exclusively_locked = false;
> LWLockRelease(PARTITION_LOCK(hash_table, partition));
> }
>
> to dshash_delete_entry(). This seems insane to me. There's not even a
> comment explaining this?

[044]

Following [001] and [009], I added
dshash_delete_currenet(). pgstat_vacuum_stat() uses it instead of
pgstat_delete_entry(). The hack is gone.

(pgstat_vacuum_stat(void))
> > }
> > dshash_release_lock(pgStatDBHash, dbentry);
>
> Wait, why are we holding the database partition lock across all this?
> Again without any comments explaining why?

[045] (I'm not sure it is fixed)
The lock is shared lock in the current version. The database entry is
needed only for attaching table hash and now the hashes won't be
removed. So, as maybe you suggested, the lock can be released earlier in:

pgstat_report_stat()
pgstat_flush_funcstats()
pgstat_vacuum_stat()
pgstat_reset_single_counter()
pgstat_report_vacuum()
pgstat_report_analyze()

The following functions are working on the database entry so lock needs to be retained till the end of its work.

pgstat_flush_dbstats()
pgstat_drop_database() /* needs exclusive lock */
pgstat_reset_counters()
pgstat_report_autovac()
pgstat_report_recovery_conflict()
pgstat_report_deadlock()
pgstat_report_tempfile()
pgstat_report_checksum_failures_in_db()
pgstat_flush_checksum_failure() /* repeats short-time lock on each dbs */

> > +void
> > +pgstat_send_archiver(const char *xlog, bool failed)
>
> Why do we still have functions named pgstat_send*?

[046] (Fixed)
Same as [029] and I changed it to pgstat_report_archiver().

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v25-0001-Use-standard-crash-handler-in-archiver.patch	text/x-patch	1.7 KB
v25-0002-sequential-scan-for-dshash.patch	text/x-patch	9.9 KB
v25-0003-Add-conditional-lock-feature-to-dshash.patch	text/x-patch	6.8 KB
v25-0004-Make-archiver-process-an-auxiliary-process.patch	text/x-patch	11.0 KB
v25-0005-Use-latch-instead-of-SIGUSR1-to-wake-up-archiver.patch	text/x-patch	7.0 KB
v25-0006-Shared-memory-based-stats-collector.patch	text/x-patch	220.8 KB
v25-0007-Doc-part-of-shared-memory-based-stats-collector.patch	text/x-patch	21.5 KB
v25-0008-Remove-the-GUC-stats_temp_directory.patch	text/x-patch	9.6 KB

In response to

Re: shared-memory based stats collector at 2020-03-13 03:13:24 from Andres Freund

Responses

Re: shared-memory based stats collector at 2020-03-19 19:54:10 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fujii Masao	2020-03-19 12:03:02	Re: WAL usage calculation patch
Previous Message	Pavel Stehule	2020-03-19 11:19:10	Re: plan cache overhead on plpgsql expression