Re: shared-memory based stats collector

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: andres(at)anarazel(dot)de
Cc: masao(dot)fujii(at)oss(dot)nttdata(dot)com, gkokolatos(at)protonmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: shared-memory based stats collector
Date: 2021-04-02 06:34:54
Message-ID: 20210402.153454.1756689956911228244.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Thu, 1 Apr 2021 19:44:25 -0700, Andres Freund <andres(at)anarazel(dot)de> wrote in
> Hi,
>
> I spent quite a bit more time working on the patch. There's are large
> changes:
>
> - postmaster/pgstats.c (which is an incorrect location now that it's not
> a subprocess anymore) is split into utils/activity/pgstat.c and
> utils/activity/pgstat_kind.c. I don't love the _kind name, but I
> couldn't come up with anything better.

The place was not changed to keep footprint smaller. I agree that the
old place is not appropriate. pgstat_kind... How about changin
pgstat.c to pgstat_core.c and pgstat_kind.c to pgstat.c?

> - Implemented a new GUC, stats_fetch_consistency = {none, cache,
> snapshot}. I think the code overhead of it is pretty ok - most of the
> handling is entirely generalized.

Sounds good.

> - Most of the "per stats kind" handling is done in pgstat_kind.c. Nearly
> all the rest is done through an array with per-stats-kind information
> (extending what was already done with pgstat_sharedentsize etc).
>
> - There is no separate "pending stats" hash anymore. If there are
> pending stats, they are referenced from 'pgStatSharedRefHash' (which
> used to be the "lookup cache" hash). All the entries with pending
> stats are in double linked list pgStatPending.

Sounds reasonable. A bit silimar to TabStatusArray.. Pending stats and
shared stats share the same key so they are naturally consolidatable.

> - A stat's entry's lwlock, refcount, .. are moved into the dshash
> entry. There is no need for them to be separate anymore. Also allows
> to avoid doing some dsa lookups while holding dshash locks.
>
> - The dshash entries are not deleted until the refcount has reached
> 0. That's an important building block to avoid constantly re-creating
> stats when flushing pending stats for a dropped object.

Does that mean the entries for a dropped object is actually dropped by
the backend that has been flushd stats of the dropped object at exit?
Sounds nice.

> - The reference to the shared entry is established the first time stats
> for an object are reported. Together with the previous entry that
> avoids nearly all the avenues for re-creating already dropped stats
> (see below for the hole).
>
> - I addded a bunch of pg_regress style tests, and a larger amount of
> isolationtester tests. The latter are possibly due to a new
> pg_stat_force_next_flush() function, avoiding the need to wait until
> stats are submitted.
>
> - 2PC support for "precise" dropping of stats has been added, the
> collect_oids() based approach removed.

Cool!

> - lots of bugfixes, comments, etc...

Thanks for all of them.

> I know of one nontrivial issue that can lead to dropped stats being
> revived:
>
> Within a transaction, a functions can be called even when another
> transaction that dropped that function has already committed. I added a
> spec test reproducing the issue:
>
> # FIXME: this shows the bug that stats will be revived, because the
> # shared stats in s2 is only referenced *after* the DROP FUNCTION
> # committed. That's only possible because there is no locking (and
> # thus no stats invalidation) around function calls.
> permutation
> "s1_track_funcs_all" "s2_track_funcs_none"
> "s1_func_call" "s2_begin" "s2_func_call" "s1_func_drop" "s2_track_funcs_all" "s2_func_call" "s2_commit" "s2_ff" "s1_func_stats" "s2_func_stats"
>
> I think the best fix here would be to associate an xid with the dropped
> stats object, and only delete the dshash entry once there's no alive
> with a horizon from before that xid...

I'm not sure how we do that avoiding a full scan on dshash..

> There's also a second issue (stats for newly created objects surviving
> the transaction), but that's pretty simple to resolve.
>
> Here's all the gory details of my changes happening incrementally:
>
> https://github.com/anarazel/postgres/compare/master...shmstat
>
> I'll squash and split tomorrow. Too tired for today.

Thank you very much for all of your immense effort.

> I think this is starting to look a lot better than what we have now. But
> I'm getting less confident that it's realistic to get any of this into
> PG14, given the state of the release cycle.
>
>
>
> > I'm impressed that the way you resolved "who should load stats". Using
> > static shared memory area to hold the point to existing DSA memory
> > resolves the "first attacher problem". However somewhat doubtful
> > about the "who should write the stats file", I think it is reasonable
> > in general.
> >
> > But the current place of calling pgstat_write_stats() is a bit to
> > early. Checkpointer reports some stats *after* calling
> > ShutdownXLOG(). Perhaps we need to move it after pg_stat_report_*()
> > calls in HandleCheckpointerInterrupts().
>
> I now moved it into a before_shmem_exit(). I think that should avoid
> that problem?

I think so.

> > https://github.com/anarazel/postgres/commit/03824a236597c87c99d07aa14b9af9d6fe04dd37
> >
> > + * XXX: Why is this a good place to do this?
> >
> > Agreed. We don't need to so haste to clean up stats entries. We could
> > run that in pgstat_reporT_stat()?
>
> I've not changed that yet, but I had the same thought.
>
>
> > Agreed that it's better to move database stat entries to fixed
> > pointers.
>
> I actually ended up reverting that. My main motivation for it was that
> it was problematic that new pending database stats entries could be
> created at some random place in the hashtable. But with the linked list
> of pending entries that's not a problem anymore. And I found it
> nontrivial to manage the refcounts to the shared entry accurately this
> way.
>
> We could still add a cache for the two stats entries though...

Yeah.

> > > - consider removing PgStatTypes and replacing it with the oid of the
> > > table the type of stats reside in. So PGSTAT_TYPE_DB would be
> > > DatabaseRelationId, PGSTAT_TYPE_TABLE would be RelationRelationId, ...
> > >
> > > I think that'd make the system more cleanly extensible going forward?
> >
> > I'm not sure that works as expected. We already separated repliation
> > stats from the unified stats hash and pgstat_read/write_statsfile()
> > needs have the corresponding specific code path.
>
> I didn't quite go towards my proposal, but I think I got a lot closer
> towards not needing much extra code for additional types of stats. I
> even added an XXX to pgstat_read/write_statsfile() that show how they
> now could be made generic.

I'll check it.

> > > - the replication slot stuff isn't quite right in my branch
> >
> > Ah, yeah. As I mentioned above I think it should be in the unified
> > stats and should have a special means of shotcut. And the global
> > stats also should be the same.
>
> The problem is that I use indexes for addressing, but that they can
> change between restarts. I think we can fix that fairly easily, by
> mapping names to indices once, pgstat_restore_stats(). At the point we
> call pgstat_restore_stats() StartupReplicationSlots() already was
> executed, so we can just inquire at that point...

Does that mean the saved replslot stats is keyed by their names?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2021-04-02 07:28:45 Re: Fix typo in verify_heapam.c
Previous Message Masahiko Sawada 2021-04-02 06:02:33 Fix typo in verify_heapam.c