Quick Links

Re: shared-memory based stats collector - v66

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	melanieplageman(at)gmail(dot)com
Cc:	andres(at)anarazel(dot)de, ibrar(dot)ahmad(at)gmail(dot)com, masao(dot)fujii(at)oss(dot)nttdata(dot)com, gkokolatos(at)protonmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: shared-memory based stats collector - v66
Date:	2022-03-25 08:24:18
Message-ID:	20220325.172418.1055357550509939415.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

At Fri, 25 Mar 2022 14:22:56 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At Thu, 24 Mar 2022 13:21:33 -0400, Melanie Plageman <melanieplageman(at)gmail(dot)com> wrote in
> > On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >
> > > The biggest todos are:
> > > - Address all the remaining AFIXMEs and XXXs
> >
> > Attached is a patch that addresses three of the existing AFIXMEs.

I'd like to dump out my humble thoughts about other AFIXMEs..

> AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> for increasing it?

It is 1000ms in the comment just above but actually 10000ms. The
number came from a discussion that if we have 1000 clients and each
backend writes stats once per 0.5 seconds, totally we flush pending
data to shared area at 2000 times per second which is too frequent. I
raised it to 5000ms, then 10000ms. So the expected maximum flush
frequency is reduces to 100 times per second. Of course it is
assuming the worst case and the 10000ms is apparently too long for the
average cases.

The current implement of pgstat postpones flushing if lock collision
happens then postpone by at most 60s. This is a kind of
auto-averaging mechanishm. It might be enough and we can reduce the
PGSTAT_MIN_INTERVAL to 500ms or so.

> AFIXME: architecture explanation.

Mmm. next, please:p

( [PGSTAT_KIND_REPLSLOT] = {)
> * AFIXME: With a bit of extra work this could now be a !fixed_amount
> * stats kind.

Yeah. The most bothersome point is the slot index is not persistent
at all and the relationship between the index and name (or identity)
is not stable even within a process life. It can be resolved by
allocating an object id to every replication slot. I faintly remember
of a discussion like that but I don't have a clear memory of the
discussion.

> static Size
> pgstat_dsa_init_size(void)
> {
> /*
> * AFIXME: What should we choose as an initial size? Should we make this
> * configurable? Maybe tune based on NBuffers?

> StatsShmemInit(void)
> * AFIXME: we need to guarantee this can be allocated in plain shared
> * memory, rather than allocating dsm segments.

I'm not sure that NBuffers is the ideal base for deciding the required
size since it doesn't seem to be generally in proportion with the
number of database objects. If we made it manually-tunable, we will
be able to emit a log when DSM segment allocation happens for this use
as as the tuning aid..

WARNING: dsa allocation happened for activity statistics
HINT: You might want to increase stat_dsa_initial_size if you see slow
down blah..

> * AFIXME: Should all the stats drop code be moved into pgstat_drop.c?

Or pgstat_xact.c?

> * AFIXME: comment
> * AFIXME: see notes about race conditions for functions in
> * pgstat_drop_function().
> */
> void
> pgstat_schedule_stat_drop(PgStatKind kind, Oid dboid, Oid objoid)

pgstat_drop_function() doesn't seem to have such a note.

I suppose the "race condition" means the case a stats entry for an
object is created just after the same object is dropped on another
backend. It seems to me such a race condition is eliminated by the
transactional drop mechanism. Are you intending to write an
explanation of that?

> /*
> * pgStatSharedRefAge increments quite slowly than the time the following
> * loop takes so this is expected to iterate no more than twice.
> *
> * AFIXME: Why is this a good place to do this?
> */
> while (pgstat_shared_refs_need_gc())
> pgstat_shared_refs_gc();

Is the reason for the AFIXME is you think that GC-check happens too
frequently?

> pgstat_shared_ref_release(PgStatHashKey key, PgStatSharedRef *shared_ref)
> {
...
> * AFIXME: this probably is racy. Another backend could look up the
> * stat, bump the refcount, as we free it.
> if (pg_atomic_fetch_sub_u32(&shared_ref->shared_entry->refcount, 1) == 1)
> {
...
> /* only dropped entries can reach a 0 refcount */
> Assert(shared_ref->shared_entry->dropped);

I didn't deeply examined, but is that race condition avoidable by
prevent pgstat_shared_ref_get from incrementing the refcount of
dropped entries?

> * AFIXME: This needs to be deduplicated with pgstat_shared_ref_release(). But
> * it's not entirely trivial, because we can't use plain dshash_delete_entry()
> * (but have to use dshash_delete_current()).
> */
> static bool
> pgstat_drop_stats_entry(dshash_seq_status *hstat)
...
> * AFIXME: don't do this while holding the dshash lock.

Is the AFIXMEs mean that we should move the call to
pgstat_shared_ref_release() out of the dshash-loop (in
pgstat_drop_database_and_contents) that calls this function? Is it
sensible if we store the (key, ref) pairs for to-be released
shared_refs then clean up them after exiting the loop?

> * Database stats contain other stats. Drop those as well when
> * dropping the database. AFIXME: Perhaps this should be done in a
> * slightly more principled way?
> */
> if (key.kind == PGSTAT_KIND_DB)
> pgstat_drop_database_and_contents(key.dboid);

I tend to agree to that and it is possible that we have
PgStatKindInfo.drop_cascade_cb(PgStatShm_StatEntryHeader *header). But
it is really needed only by PGSTAT_KIND_DB..

> * AFIXME: consistent naming
> * AFIXME: deduplicate some of this code with pgstat_fetch_snapshot_build().
> *
> * AFIXME: it'd be nicer if we passed .snapshot_cb() the target memory
> * location, instead of putting PgStatSnapshot into pgstat_internal.h
> */
> void
> pgstat_snapshot_global(PgStatKind kind)

Does having PGSTAT_KIND_NONE in PgStatKind or InvalidPgStatKind work
for deduplication? But I'm afraid that harms in some way.

For the memory location, it seems like a matter of taste, but if we
don't need a multiple copies of a global snapshot, I think
.snapshot_cb() doesn't need to take the target memory location.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Re: shared-memory based stats collector - v66 at 2022-03-25 05:22:56 from Kyotaro Horiguchi

Responses

Re: shared-memory based stats collector - v66 at 2022-03-29 07:24:17 from Andres Freund
Re: shared-memory based stats collector - v66 at 2022-04-02 08:16:48 from Andres Freund
Re: shared-memory based stats collector - v66 at 2022-04-02 08:21:02 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrei Zubkov	2022-03-25 08:28:06	Re: [PATCH] pg_statio_all_tables: several rows per table due to invalid TOAST index
Previous Message	PG Bug reporting form	2022-03-25 07:52:57	BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.