Re: shared-memory based stats collector

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: andres(at)anarazel(dot)de
Cc: michael(at)paquier(dot)xyz, alvherre(at)2ndquadrant(dot)com, thomas(dot)munro(at)gmail(dot)com, tomas(dot)vondra(at)2ndquadrant(dot)com, a(dot)zakirov(at)postgrespro(dot)ru, ah(at)cybertec(dot)at, magnus(at)hagander(dot)net, robertmhaas(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: shared-memory based stats collector
Date: 2020-03-27 07:31:15
Message-ID: 20200327.163115.1814716981588442968.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello.

At Thu, 19 Mar 2020 12:54:10 -0700, Andres Freund <andres(at)anarazel(dot)de> wrote in
> Hi,
>
> On 2020-03-19 20:30:04 +0900, Kyotaro Horiguchi wrote:
> > > I think we also can get rid of the dshash_delete changes, by instead
> > > adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
> > > or such.
> >
> > [009] (Fixed)
> > I'm not sure about the point of having two interfaces that are hard to
> > distinguish. Maybe dshash_delete_current(dshash_seq_stat *status) is
> > enough(). I also reverted the dshash_delete().
>
> Well, dshash_delete() cannot generally safely be used together with
> iteration. It has to be the current element etc. And I think the locking
> changes make dshash less robust. By explicitly tying "delete the current
> element" to the iterator, most of that can be avoided.

Sure. By the way I forgot to remove seqscan_running stuff. Removed.

> > > > /* SIGUSR1 signal handler for archiver process */
> > >
> > > Hm - this currently doesn't set up a correct sigusr1 handler for a
> > > shared memory backend - needs to invoke procsignal_sigusr1_handler
> > > somewhere.
> > >
> > > We can probably just convert to using normal latches here, and remove
> > > the current 'wakened' logic? That'll remove the indirection via
> > > postmaster too, which is nice.
> >
> > [018] (Fixed, separate patch 0005)
> > It seems better. I added it as a separate patch just after the patch
> > that turns archiver an auxiliary process.
>
> I don't think it's correct to do it separately, but I can just merge
> that on commit.

Yes, it's just for the convenience of reviewing. Merged.

> > > > @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
> > > >
> > > > switch (backendType)
> > > > {
> > > > + case B_ARCHIVER:
> > > > + backendDesc = "archiver";
> > > > + break;
> > >
> > > should imo include 'WAL' or such.
> >
> > [019] (Not Fixed)
> > It is already named "archiver" by 8e8a0becb3. Do I rename it in this
> > patch set?
>
> Oh. No, don't rename it as part of this. Could you reply to the thread
> in which Peter made that change, and reference this complaint?

I sent a mail like that.

https://www.postgresql.org/message-id/20200327.163007.128069746774242774.horikyota.ntt%40gmail.com

> > [021] (Fixed, separate patch 0007)
> > However the "statistics collector process" is gone, I'm not sure
> > "statistics collector" feature also is gone. But actually the word
> > "collector" looks a bit odd in some context. I replaced "the results
> > of statistics collector" with "the activity statistics". (I'm not sure
> > "the activity statistics" is proper as a subsystem name.) The word
> > "collect" is replaced with "track". I didn't change section IDs
> > corresponding to the renaming so that old links can work. I also fixed
> > the tranche name for LWTRANCHE_STATS from "activity stats" to
> > "activity_statistics"
>
> Without having gone through the changes, that sounds like the correct
> direction to me. There's no "collector" anymore, so removing that seems
> like the right thing.

Thanks.

> > > > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> > > > index ca5c6376e5..1ffe073a1f 100644
> > > > --- a/src/backend/postmaster/pgstat.c
> > > > +++ b/src/backend/postmaster/pgstat.c
...
> > [024] (Fixed, Maybe)
> > Although not sure I get you correctly, I rewrote it as the follows.
..
> I was thinking of something like:
> * Collects activity statistics, e.g. per-table access statistics, of
> * all backends in shared memory. The activity numbers are first stored
> * locally in each process, then flushed to shared memory at commit
> * time or by idle-timeout.

Looks fine. Replaced it with the above.

> > [025] (Not Fixed)
> > Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
> > 10000) reasonable?
>
> Partially. I think for access to shared resources we want *increasing*
> wait times, rather than shorter retry timeout. The goal should be to be
> to make it more likely for all processes to be able to flush their
> stats, which can be achieved by flushing less often after hitting
> contention.

Ah! Indeed. The attached works the following way.

* To avoid congestion on the shared memory, shared stats is updated no more
* often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
* remain unflushed for lock failure, retry with intervals that is initially
* PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
* force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.

Concretely the interval changes as:

elapsed interval
-------------+--------------
0ms (1000ms)
1000ms 250ms
1250ms 500ms
1750ms 1000ms
2750ms 2000ms
4759ms 5250ms (not 4000ms)
10000ms

On the way fixing it I fixed several silly bugs:
- pgstat_report_stat accessed dbent even if it is NULL.
- pgstat_flush_tabstats set have_(sh|my)database_stats wrongly.

> > [029] (Fixed) (Related to [046])
> > Mmm. It's following your old suggestion to avoid unsubstantial
> > diffs. I'm happy to change it. The functions that have "send" in their
> > names are for the same reason. I removed the prefix "m_" of the
> > members of the struct. (The comment above (with a typo) explains that).
>
> I don't object to having the rename be a separate patch...

Nope. I don't want make it a separate patch.

> > > > + if (StatsShmem->refcount > 0)
> > > > + StatsShmem->refcount++;
> > >
> > > What prevents us from leaking the refcount here? We could e.g. error out
> > > while attaching, no? Which'd mean we'd leak the refcount.
> >
> > [033] (Fixed)
> > We don't attach shared stats on postmaster process, so I want to know
> > the first attacher process and the last detacher process of shared
> > stats. It's not leaks that I'm considering here.
> > (continued below)
> >
> > > To me it looks like there's a lot of added complexity just because you
> > > want to be able to reset stats via
...
> > > Without that you'd not need the complexity of attaching, detaching to
> > > the same degree - every backend could just cache lookup data during
> > > initialization, instead of having to constantly re-compute that.
> >
> > Mmm. I don't get that (or I failed to read clear meaning). The
> > function is assumed be called only from StartupXLOG().
> > (continued)
>
> Oh? I didn't get that you're only using it for that purpose - there's
> very little documentation about what it's trying to do.

Ugg..

> I don't see why that means we don't need to accurately track the
> refcount? Otherwise we'll forget to write out the stats.

Exactly, and I added comments for that.

| * refcount is used to know whether a process going to detach shared stats is
| * the last process or not. The last process writes out the stats files.
| */
| typedef struct StatsShmemStruct

| if (--StatsShmem->refcount < 1)
| {
| /*
| * The process is the last one that is attaching the shared stats
| * memory. Write out the stats files if requested.

> > > Nor would the dynamic re-creation of the db dshash table be needed.
..
> I was referring to the fact that the last version of the patch
> attached/detached from hashtables regularly. pin_hashes, unpin_hashes,
> attach_table_hash, attach_function_hash etc.

pin/unpin is gone. Now there is only one dshash and it is attached for
the life time of process.

> > After some thoughts, I decided to rip the all "generation" stuff off
> > and it gets far simpler. But counter reset may conflict with other
> > backends with a litter higher degree because counter reset needs
> > exclusive lock.
>
> That seems harmless to me - stats reset should never happen at a high
> enough frequency to make contention it causes problematic. There's also
> an argument to be made that it makes sense for the reset to be atomic.

Agreed.

> > > > + /* Flush out table stats */
> > > > + if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> > > > + pending_stats = true;
> > > > +
> > > > + /* Flush out function stats */
> > > > + if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> > > > + pending_stats = true;
> > >
> > > This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
> > > on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
> > > being confusing while reading the code, it also made the diff much
> > > harder to read.
> >
> > [035] (Maybe Fixed)
> > Is the question that, is there any case where
> > pgstat_flush_stat/functions leaves some counters unflushed?
>
> No, the point is that there's knowledge about
> pgstat_flush_stat/pgstat_flush_funcstats outside of those functions,
> namely the pgStatTabList, pgStatFunctions lists.

Mmm. Anyway the stuff has been largely changed in this version.

> > > Why do we still have this? A hashtable lookup is cheap, compared to
> > > fetching a file - so it's not to save time. Given how infrequent the
> > > pgstat_fetch_* calls are, it's not to avoid contention either.
> > >
> > > At first one could think it's for consistency - but no, that's not it
> > > either, because snapshot_statentry() refetches the snapshot without
> > > control from the outside:
> >
> > [038]
> > I don't get the second paragraph. When the function re*create*s a
> > snapshot without control from the outside? It keeps snapshots during a
> > transaction. If not, it is broken.
> > (continued)
>
> Maybe I just misunderstood the code flow - partially due to the global
> clear_snapshot variable. I just had read the
> + * We don't want so frequent update of stats snapshot. Keep it at least
> + * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
>
> comment, and took it to mean that you're unconditionally updating the
> snapshot every PGSTAT_STAT_MIN_INTERVAL. Which'd mean we don't actually
> have consistent snapshot across all fetches.
>
> (partially this might have been due to the diff:
> /*
> - * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
> - * msec since we last sent one, or the caller wants to force stats out.
> + * We don't want so frequent update of stats snapshot. Keep it at least
> + * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
> */

Wow.. I tried "git config --global diff.algorithm patience" and it
seems works well.

> But I think my question remains: Why do we need the whole snapshot thing
> now? Previously we needed to avoid reading a potentially large file -
> but that's not a concern anymore?
...
> > Currently the behavior is documented as the follows and it seems reasonable.
> >
...
> I am very unconvinded this is worth the cost. Especially because plenty
> of other stats related parts of the system do *NOT* behave this way. How
> is a user supposed to understand that pg_stat_database behaves one way,
> pg_stat_activity, another, pg_stat_statements a third,
> pg_stat_progress_* ...
>
> Perhaps it's best to not touch the semantics here, but I'm also very
> wary of introducing significant complications and overhead just to have
> this "feature".

As a compromise, I removed the "clear_snapshot" stuff. Snapshot still
works, but now clear_snapshot() immediately clear them. It works the
same way with pg_stat_activity.

> > > > for (i = 0; i < tsa->tsa_used; i++)
> > > > {
> > > > PgStat_TableStatus *entry = &tsa->tsa_entries[i];
> > > >
> > <many TableStatsArray code>
> > > > hash_entry->tsa_entry = entry;
> > > > dest_elem++;
> > > > }
> > >
> > > This seems like too much code. Why is this entirely different from the
> > > way funcstats works? The difference was already too big before, but this
> > > made it *way* worse.
> >
> > [040]
> > We don't flush stats until transaction ends. So the description about
> > TabStatuArray is stale?
>
> How is your comment related to my comment above?

Hmm. It looks like truncated. The TableStatsArray is removed, all
kinds of local stats (except gobal stats) is now stored directly in
pgStatLocalHashEntry. The code gets far simpler.

> > > > generation = pin_hashes(dbent);
> > > > tabhash = attach_table_hash(dbent, generation);
> > >
> > > This again is just cost incurred by insisting on destroying hashtables
> > > instead of keeping them around as long as necessary.
> >
> > [040]
> > Maybe you are insisting the reverse? The pin_hash complexity is left
> > in this version. -> [033]
>
> What do you mean? What I'm saying is that we should never end up in a
> situation where there's no pgstat entry for the current database. And
> that that's trivial, as long as we don't drop the hashtable, but instead
> reset counters to 0.

In the previous version that was not sent to ML attach/detaches only
at the start/end time of process. But in this version table/function
dshashes are gone.

> > > The fact that you're locking the per-database entry unconditionally once
> > > for each table almost guarantees contention - and you're not using the
> > > 'conditional lock' approach for that. I don't understand.
> >
> > [043] (Maybe fixed) (Related to [045].)
> > Vacuum, analyze, DROP DB and reset cannot be delayed. So the
> > conditional lock is mainly used by
> > pgstat_report_stat().
>
> You're saying "cannot be delayed" - but you're not explaining *why* that
> is.
>
> Even if true, I don't see why that necessitates doing the flushing and
> locking once for each of these functions?

Sorry, that was wrong. We can just skip removal on lock failure
during pgstat_vacuum_stat(). It will be retried the next time. Other
database stats, deadlock, checksum failure, tmpfile and conflicts are
now collected locally then flushed.

> > dshash_find_or_insert didn't allow shared lock. I changed
> > dshash_find_extended to allow shared-lock even if it is told to create
> > a missing entry. Alrhough it takes exclusive lock at the mement of
> > entry creation, most of all cases it doesn't need exclusive lock. This
> > allows use shared lock while processing vacuum or analyze stats.
>
> Huh?

Well, anyway, the shared-insert mode of dshash_find_extended is no
longer needed so I removed the mode in this version.

> Just to be crystal clear: I am exceedingly unlikely to commit this with
> any sort of short term attach/detach operations. Both because of the
> runtime overhead/contention it causes is significant, and because of the
> code complexity implied by it.

I think it is addressed in this version.

> Leaving attach/detach aside: I think it's a complete no-go to acquire
> database wide locks at this frequency, and then to hold them over other
> operations that are a) not cheap b) can block. The contention due to
> that would be *terrible* for scalability, even if it's just a shared
> lock.

> The way this *should* work is that:
> 1.1) At backend startup, attach to the database wide hashtable
> 1.2) At backend startup, attach to the various per-database hashtables
> (including ones for shared tables)
> 2.1) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-table stats
> 2.2) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-function stats
> 2.3) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-database stats
> 2.4) When flushing stats that need to be flushed (e.g. vacuum): havestats && lock && flush
> 3.1) When shutting down backend, detach from all hashtables
>
>
> That way we never need to hold onto the database-wide hashtables for
> long, and we can do it with conditional locks (trylock above), unless we
> need to force flushing.

I think the attached works the similar way. Table/function stats are
processed togehter, then database stats is processed.

> It might be worthwhile to merge per-table, per-function, per-database
> hashes into a single hash. Where the key is either something like
> {hashkind, objoid} (referenced from a per-database hashtable), or even
> {hashkind, dboid, objoid} (one global hashtable).
>
> I think the contents of the hashtable should likely just be a single
> dsa_pointer (plus some bookkeeping). Several reasons for that:
>
> 1) Since one goal of this is to make the stats system more extensible,
> it seems important that we can make the set of stats kept
> runtime configurable. Otherwise everyone will continue to have to pay
> the price for every potential stat that we have an option to track.
>
> 2) Having hashtable resizes move fairly large stat entries around is
> expensive. Whereas just moving key + dsa_pointer around is pretty
> cheap. I don't think the cost of a pointer dereference matters in
> *this* case.
>
> 3) If the stats contents aren't moved around, there's no need to worry
> about hashtable resizes. Therefore the stats can be referenced
> without holding dshash partition locks.
>
> 4) If the stats entries aren't moved around by hashtable resizes, we can
> use atomics, lwlocks, spinlocks etc as part of the stats entry. It's
> not generally correct/safe to have dshash resize to move those
> around.
>
>
> All of that would be addressed if we instead allocate the stats data
> separately from the dshash entry.

OK, I'm convinced by that (and I like it). The attached v27 is largely
changed from the previous version following the suggeston.

1) DB, table, function stats are stored into one hash keyed by (type,
dbid, objectid) and handled in unified way. Now pgstat_report_stat
flushes stats the following way.

while (hash_seq_search on local stats hash)
{
switch (ent->stats->type)
{
case PGSTAT_TYPE_DB: ...
case PGSTAT_TYPE_TABLE: ...
case PGSTAT_TYPE_FUNCTION: ...
}
}

2, 3) There's only one dshash table pgStatSharedHash. Its entry is
defined as the follows.

+typedef struct PgStatHashEntry
+{
+ PgStatHashEntryKey key; /* hash key */
+ dsa_pointer stats; /* pointer to shared stats entry in DSA */
+} PgStatHashEntry;

key is (type, databaseid, objectid)

To handle entries of different types common way, the hash entry
points to the following struct stored in DSA memory.

+typedef struct PgStatEntry
+{
+ PgStatTypes type; /* statistics entry type */
+ size_t len; /* length of body, fixed per type. */
+ LWLock lock; /* lightweight lock to protect body */
+ char body[FLEXIBLE_ARRAY_MEMBER]; /* statistics body */
+} PgStatEntry;

The body stores the existing PgStat_Stat*Entry structs.

To match the shared stats, locally-stored stats entries are changed
similar way.

4) As shown above, I'm using LWLock in this version.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v27-0001-Use-standard-crash-handler-in-archiver.patch text/x-patch 1.7 KB
v27-0002-sequential-scan-for-dshash.patch text/x-patch 8.3 KB
v27-0003-Add-conditional-lock-feature-to-dshash.patch text/x-patch 6.1 KB
v27-0004-Make-archiver-process-an-auxiliary-process.patch text/x-patch 16.9 KB
v27-0005-Shared-memory-based-stats-collector.patch text/x-patch 227.8 KB
v27-0006-Doc-part-of-shared-memory-based-stats-collector.patch text/x-patch 21.5 KB
v27-0007-Remove-the-GUC-stats_temp_directory.patch text/x-patch 9.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-03-27 07:35:57 Re: Some problems of recovery conflict wait events
Previous Message Kyotaro Horiguchi 2020-03-27 07:30:07 Re: backend type in log_line_prefix?