Re: shared-memory based stats collector

From: Andres Freund <andres(at)anarazel(dot)de>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: michael(at)paquier(dot)xyz, alvherre(at)2ndquadrant(dot)com, thomas(dot)munro(at)gmail(dot)com, tomas(dot)vondra(at)2ndquadrant(dot)com, a(dot)zakirov(at)postgrespro(dot)ru, ah(at)cybertec(dot)at, magnus(at)hagander(dot)net, robertmhaas(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: shared-memory based stats collector
Date: 2020-03-19 19:54:10
Message-ID: 20200319195410.icib45bbgjwqb5zn@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-03-19 20:30:04 +0900, Kyotaro Horiguchi wrote:
> > I think we also can get rid of the dshash_delete changes, by instead
> > adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
> > or such.
>
> [009] (Fixed)
> I'm not sure about the point of having two interfaces that are hard to
> distinguish. Maybe dshash_delete_current(dshash_seq_stat *status) is
> enough(). I also reverted the dshash_delete().

Well, dshash_delete() cannot generally safely be used together with
iteration. It has to be the current element etc. And I think the locking
changes make dshash less robust. By explicitly tying "delete the current
element" to the iterator, most of that can be avoided.

> > > /* SIGUSR1 signal handler for archiver process */
> >
> > Hm - this currently doesn't set up a correct sigusr1 handler for a
> > shared memory backend - needs to invoke procsignal_sigusr1_handler
> > somewhere.
> >
> > We can probably just convert to using normal latches here, and remove
> > the current 'wakened' logic? That'll remove the indirection via
> > postmaster too, which is nice.
>
> [018] (Fixed, separate patch 0005)
> It seems better. I added it as a separate patch just after the patch
> that turns archiver an auxiliary process.

I don't think it's correct to do it separately, but I can just merge
that on commit.

> > > @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
> > >
> > > switch (backendType)
> > > {
> > > + case B_ARCHIVER:
> > > + backendDesc = "archiver";
> > > + break;
> >
> > should imo include 'WAL' or such.
>
> [019] (Not Fixed)
> It is already named "archiver" by 8e8a0becb3. Do I rename it in this
> patch set?

Oh. No, don't rename it as part of this. Could you reply to the thread
in which Peter made that change, and reference this complaint?

> [021] (Fixed, separate patch 0007)
> However the "statistics collector process" is gone, I'm not sure
> "statistics collector" feature also is gone. But actually the word
> "collector" looks a bit odd in some context. I replaced "the results
> of statistics collector" with "the activity statistics". (I'm not sure
> "the activity statistics" is proper as a subsystem name.) The word
> "collect" is replaced with "track". I didn't change section IDs
> corresponding to the renaming so that old links can work. I also fixed
> the tranche name for LWTRANCHE_STATS from "activity stats" to
> "activity_statistics"

Without having gone through the changes, that sounds like the correct
direction to me. There's no "collector" anymore, so removing that seems
like the right thing.

> > > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> > > index ca5c6376e5..1ffe073a1f 100644
> > > --- a/src/backend/postmaster/pgstat.c
> > > +++ b/src/backend/postmaster/pgstat.c
> > > + * Collects per-table and per-function usage statistics of all backends on
> > > + * shared memory. pg_count_*() and friends are the interface to locally store
> > > + * backend activities during a transaction. Then pgstat_flush_stat() is called
> > > + * at the end of a transaction to pulish the local stats on shared memory.
> > > *
> >
> > I'd rather not exhaustively list the different objects this handles -
> > it'll either be annoying to maintain, or just get out of date.
>
> [024] (Fixed, Maybe)
> Although not sure I get you correctly, I rewrote it as the follows.
>
> * Collects per-table and per-function usage statistics of all backends on
> * shared memory. The activity numbers are once stored locally, then written
> * to shared memory at commit time or by idle-timeout.

s/backends on/backends in/

I was thinking of something like:
* Collects activity statistics, e.g. per-table access statistics, of
* all backends in shared memory. The activity numbers are first stored
* locally in each process, then flushed to shared memory at commit
* time or by idle-timeout.

> > > - * - Add some automatic call for pgstat vacuuming.
> > > + * To avoid congestion on the shared memory, we update shared stats no more
> > > + * often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
> > > + * all the local numbers cannot be flushed immediately, we postpone updates
> > > + * and try the next chance after the interval of
> > > + * PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
> > > + * PGSTAT_STAT_MAX_INTERVAL(1000ms).
> >
> > I'm not convinced by this backoff logic. The basic interval seems quite
> > high for something going through shared memory, and the max retry seems
> > pretty low.
>
> [025] (Not Fixed)
> Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
> 10000) reasonable?

Partially. I think for access to shared resources we want *increasing*
wait times, rather than shorter retry timeout. The goal should be to be
to make it more likely for all processes to be able to flush their
stats, which can be achieved by flushing less often after hitting
contention.

> > > +/*
> > > + * BgWriter global statistics counters. The name cntains a remnant from the
> > > + * time when the stats collector was a dedicate process, which used sockets to
> > > + * send it.
> > > + */
> > > +PgStat_MsgBgWriter BgWriterStats = {0};
> >
> > I am strongly against keeping the 'Msg' prefix. That seems extremely
> > confusing going forward.
>
> [029] (Fixed) (Related to [046])
> Mmm. It's following your old suggestion to avoid unsubstantial
> diffs. I'm happy to change it. The functions that have "send" in their
> names are for the same reason. I removed the prefix "m_" of the
> members of the struct. (The comment above (with a typo) explains that).

I don't object to having the rename be a separate patch...

> > > + if (StatsShmem->refcount > 0)
> > > + StatsShmem->refcount++;
> >
> > What prevents us from leaking the refcount here? We could e.g. error out
> > while attaching, no? Which'd mean we'd leak the refcount.
>
> [033] (Fixed)
> We don't attach shared stats on postmaster process, so I want to know
> the first attacher process and the last detacher process of shared
> stats. It's not leaks that I'm considering here.
> (continued below)
>
> > To me it looks like there's a lot of added complexity just because you
> > want to be able to reset stats via
> >
> > void
> > pgstat_reset_all(void)
> > {
> >
> > /*
> > * We could directly remove files and recreate the shared memory area. But
> > * detach then attach for simplicity.
> > */
> > pgstat_detach_shared_stats(false); /* Don't write */
> > pgstat_attach_shared_stats();
> >
> > Without that you'd not need the complexity of attaching, detaching to
> > the same degree - every backend could just cache lookup data during
> > initialization, instead of having to constantly re-compute that.
>
> Mmm. I don't get that (or I failed to read clear meaning). The
> function is assumed be called only from StartupXLOG().
> (continued)

Oh? I didn't get that you're only using it for that purpose - there's
very little documentation about what it's trying to do.

I don't see why that means we don't need to accurately track the
refcount? Otherwise we'll forget to write out the stats.

> > Nor would the dynamic re-creation of the db dshash table be needed.
>
> Maybe you are mentioning the complexity of reset_dbentry_counters? It
> is actually complex. Shared stats dshash cannot be destroyed (or
> dshash entry cannot be removed) during someone is working on it. It
> was simpler to wait for another process to end its work but that could
> slow not only the clearing process but also other processes by
> frequent resetting of counters.

I was referring to the fact that the last version of the patch
attached/detached from hashtables regularly. pin_hashes, unpin_hashes,
attach_table_hash, attach_function_hash etc.

> After some thoughts, I decided to rip the all "generation" stuff off
> and it gets far simpler. But counter reset may conflict with other
> backends with a litter higher degree because counter reset needs
> exclusive lock.

That seems harmless to me - stats reset should never happen at a high
enough frequency to make contention it causes problematic. There's also
an argument to be made that it makes sense for the reset to be atomic.

> > > + /* Flush out table stats */
> > > + if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> > > + pending_stats = true;
> > > +
> > > + /* Flush out function stats */
> > > + if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> > > + pending_stats = true;
> >
> > This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
> > on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
> > being confusing while reading the code, it also made the diff much
> > harder to read.
>
> [035] (Maybe Fixed)
> Is the question that, is there any case where
> pgstat_flush_stat/functions leaves some counters unflushed?

No, the point is that there's knowledge about
pgstat_flush_stat/pgstat_flush_funcstats outside of those functions,
namely the pgStatTabList, pgStatFunctions lists.

> > Why do we still have this? A hashtable lookup is cheap, compared to
> > fetching a file - so it's not to save time. Given how infrequent the
> > pgstat_fetch_* calls are, it's not to avoid contention either.
> >
> > At first one could think it's for consistency - but no, that's not it
> > either, because snapshot_statentry() refetches the snapshot without
> > control from the outside:
>
> [038]
> I don't get the second paragraph. When the function re*create*s a
> snapshot without control from the outside? It keeps snapshots during a
> transaction. If not, it is broken.
> (continued)

Maybe I just misunderstood the code flow - partially due to the global
clear_snapshot variable. I just had read the
+ * We don't want so frequent update of stats snapshot. Keep it at least
+ * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.

comment, and took it to mean that you're unconditionally updating the
snapshot every PGSTAT_STAT_MIN_INTERVAL. Which'd mean we don't actually
have consistent snapshot across all fetches.

(partially this might have been due to the diff:
/*
- * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
- * msec since we last sent one, or the caller wants to force stats out.
+ * We don't want so frequent update of stats snapshot. Keep it at least
+ * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
*/
- now = GetCurrentTransactionStopTimestamp();
- if (!force &&
- !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
- return;
- last_report = now;
+ if (clear_snapshot)
+ {
+ clear_snapshot = false;
+
+ if (pgStatSnapshotContext &&
)

But I think my question remains: Why do we need the whole snapshot thing
now? Previously we needed to avoid reading a potentially large file -
but that's not a concern anymore?

> > > /*
> > > * We don't want so frequent update of stats snapshot. Keep it at least
> > > * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
> > > */
> ...
> > I think we should just remove this entire local caching snapshot layer
> > for lookups.
>
> Currently the behavior is documented as the follows and it seems reasonable.
>
> Another important point is that when a server process is asked to display
> any of these statistics, it first fetches the most recent report emitted by
> the collector process and then continues to use this snapshot for all
> statistical views and functions until the end of its current transaction.
> So the statistics will show static information as long as you continue the
> current transaction. Similarly, information about the current queries of
> all sessions is collected when any such information is first requested
> within a transaction, and the same information will be displayed throughout
> the transaction.
> This is a feature, not a bug, because it allows you to perform several
> queries on the statistics and correlate the results without worrying that
> the numbers are changing underneath you. But if you want to see new
> results with each query, be sure to do the queries outside any transaction
> block. Alternatively, you can invoke
> <function>pg_stat_clear_snapshot</function>(), which will discard the
> current transaction's statistics snapshot (if any). The next use of
> statistical information will cause a new snapshot to be fetched.

I am very unconvinded this is worth the cost. Especially because plenty
of other stats related parts of the system do *NOT* behave this way. How
is a user supposed to understand that pg_stat_database behaves one way,
pg_stat_activity, another, pg_stat_statements a third,
pg_stat_progress_* ...

Perhaps it's best to not touch the semantics here, but I'm also very
wary of introducing significant complications and overhead just to have
this "feature".

> > > for (i = 0; i < tsa->tsa_used; i++)
> > > {
> > > PgStat_TableStatus *entry = &tsa->tsa_entries[i];
> > >
> <many TableStatsArray code>
> > > hash_entry->tsa_entry = entry;
> > > dest_elem++;
> > > }
> >
> > This seems like too much code. Why is this entirely different from the
> > way funcstats works? The difference was already too big before, but this
> > made it *way* worse.
>
> [040]
> We don't flush stats until transaction ends. So the description about
> TabStatuArray is stale?

How is your comment related to my comment above?

> > > bool
> > > pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
> > > PgStat_TableStatus *entry)
> > > {
> > > Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
> > > int table_mode = PGSTAT_EXCLUSIVE;
> > > bool updated = false;
> > > dshash_table *tabhash;
> > > PgStat_StatDBEntry *dbent;
> > > int generation;
> > >
> > > if (nowait)
> > > table_mode |= PGSTAT_NOWAIT;
> > >
> > > /* Attach required table hash if not yet. */
> > > if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
> > > {
> > > /*
> > > * Return if we don't have corresponding dbentry. It would've been
> > > * removed.
> > > */
> > > dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
> > > if (!dbent)
> > > return false;
> > >
> > > /*
> > > * We don't hold lock on the dbentry since it cannot be dropped while
> > > * we are working on it.
> > > */
> > > generation = pin_hashes(dbent);
> > > tabhash = attach_table_hash(dbent, generation);
> >
> > This again is just cost incurred by insisting on destroying hashtables
> > instead of keeping them around as long as necessary.
>
> [040]
> Maybe you are insisting the reverse? The pin_hash complexity is left
> in this version. -> [033]

What do you mean? What I'm saying is that we should never end up in a
situation where there's no pgstat entry for the current database. And
that that's trivial, as long as we don't drop the hashtable, but instead
reset counters to 0.

> > > dbentry = pgstat_get_db_entry(MyDatabaseId,
> > > PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
> > > &status);
> > >
> > > if (status == LOCK_FAILED)
> > > return;
> > >
> > > /* We had a chance to flush immediately */
> > > pgstat_flush_recovery_conflict(dbentry);
> > >
> > > dshash_release_lock(pgStatDBHash, dbentry);
> >
> > But I don't understand why? Nor why we'd not just report all pending
> > database wide changes in that case?
> >
> > The fact that you're locking the per-database entry unconditionally once
> > for each table almost guarantees contention - and you're not using the
> > 'conditional lock' approach for that. I don't understand.
>
> [043] (Maybe fixed) (Related to [045].)
> Vacuum, analyze, DROP DB and reset cannot be delayed. So the
> conditional lock is mainly used by
> pgstat_report_stat().

You're saying "cannot be delayed" - but you're not explaining *why* that
is.

Even if true, I don't see why that necessitates doing the flushing and
locking once for each of these functions?

> dshash_find_or_insert didn't allow shared lock. I changed
> dshash_find_extended to allow shared-lock even if it is told to create
> a missing entry. Alrhough it takes exclusive lock at the mement of
> entry creation, most of all cases it doesn't need exclusive lock. This
> allows use shared lock while processing vacuum or analyze stats.

Huh?

> Previously I thought that we can work on a shared database entry while
> lock is not held, but actually there are cases where insertion of a
> new database entry causes rehash (resize). The operation moves entries
> so we need at least shared lock on database entry while we are working
> on it. So in the attched basically most operations are working by the
> following steps.
> - get shared database entry with shared lock
> - attach table/function hash
> - fetch an entry with exclusive lock
> - update entry
> - release the table/function entry
> - detach table/function hash
> if needed
> - take LW_EXCLUSIVE on database entry
> - update database numbers
> - release LWLock
> - release shared database entry

Just to be crystal clear: I am exceedingly unlikely to commit this with
any sort of short term attach/detach operations. Both because of the
runtime overhead/contention it causes is significant, and because of the
code complexity implied by it.

Leaving attach/detach aside: I think it's a complete no-go to acquire
database wide locks at this frequency, and then to hold them over other
operations that are a) not cheap b) can block. The contention due to
that would be *terrible* for scalability, even if it's just a shared
lock.

The way this *should* work is that:
1.1) At backend startup, attach to the database wide hashtable
1.2) At backend startup, attach to the various per-database hashtables
(including ones for shared tables)
2.1) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-table stats
2.2) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-function stats
2.3) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-database stats
2.4) When flushing stats that need to be flushed (e.g. vacuum): havestats && lock && flush
3.1) When shutting down backend, detach from all hashtables

That way we never need to hold onto the database-wide hashtables for
long, and we can do it with conditional locks (trylock above), unless we
need to force flushing.

It might be worthwhile to merge per-table, per-function, per-database
hashes into a single hash. Where the key is either something like
{hashkind, objoid} (referenced from a per-database hashtable), or even
{hashkind, dboid, objoid} (one global hashtable).

I think the contents of the hashtable should likely just be a single
dsa_pointer (plus some bookkeeping). Several reasons for that:

1) Since one goal of this is to make the stats system more extensible,
it seems important that we can make the set of stats kept
runtime configurable. Otherwise everyone will continue to have to pay
the price for every potential stat that we have an option to track.

2) Having hashtable resizes move fairly large stat entries around is
expensive. Whereas just moving key + dsa_pointer around is pretty
cheap. I don't think the cost of a pointer dereference matters in
*this* case.

3) If the stats contents aren't moved around, there's no need to worry
about hashtable resizes. Therefore the stats can be referenced
without holding dshash partition locks.

4) If the stats entries aren't moved around by hashtable resizes, we can
use atomics, lwlocks, spinlocks etc as part of the stats entry. It's
not generally correct/safe to have dshash resize to move those
around.

All of that would be addressed if we instead allocate the stats data
separately from the dshash entry.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2020-03-19 19:56:59 Re: Make MemoryContextMemAllocated() more precise
Previous Message Dean Rasheed 2020-03-19 19:53:39 Re: PATCH: add support for IN and @> in functional-dependency statistics use