Re: Higher level questions around shared memory stats

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Higher level questions around shared memory stats
Date: 2022-03-30 21:08:41
Message-ID: 20220330210841.2dymkb2ydbdonp5f@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-03-30 14:42:23 -0400, Robert Haas wrote:
> On Tue, Mar 29, 2022 at 5:01 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I think it's reasonably rare because in cases there'd be corruption, we'd
> > typically not even have written them out / throw them away explicitly - we
> > only read stats when starting without crash recovery.
> >
> > So the "expected" case of corruption afaicts solely is a OS crash just after
> > the shutdown checkpoint completed?
>
> Can we prevent that case from occurring, so that there are no expected cases?

We likely can, at least for the causes of corruption I know of. We already
write the statsfile into a temporary filename and then rename into place. I
think all we'd need to do is to use durable_rename() to make sure it's durable
once renamed into place.

It's really unrelated to the shared memory stats patch though, so I'd prefer
not to tie it to that.

> > I can think of these different times:
> >
> > - Last time stats were removed due to starting up in crash recovery
> > - Last time stats were created from scratch, because no stats data file was
> > present at startup
> > - Last time stats were thrown away due to corruption
> > - Last time a subset of stats were reset using one of the pg_reset* functions
> >
> > Makes sense?
>
> Yes. Possibly that last could be broken in to two: when all stats were
> last reset, when some stats were last reset.

Believe it or not, we don't currently have a function to reset all stats. We
should definitely add that though, because the invocation to reset all stats
gets more ridiculous^Wcomplicated with each release.

I think the minimal invocation currently is something like:

-- reset all stats shared between databases
SELECT pg_stat_reset_shared('archiver');
SELECT pg_stat_reset_shared('bgwriter');
SELECT pg_stat_reset_shared('wal');
SELECT pg_stat_reset_replication_slot(NULL);
SELECT pg_stat_reset_slru(NULL);
SELECT pg_stat_reset_subscription_stats(NULL);

-- connect to each database and reset the stats in that database
SELECT pg_stat_reset();

I've protested against replication slot, slru, subscription stats not being
resettable via pg_stat_reset_shared(), nobody else seemed to care.

> > > Does redo update the stats?
> >
> > With "update" do you mean generate new stats? In the shared memory stats patch
> > it triggers stats to be dropped, on HEAD it just resets all stats at startup.
> >
> > Redo itself doesn't generate stats, but bgwriter, checkpointer, backends do.
>
> Well, I guess what I'm trying to figure out is what happens if we run
> in recovery for a long time -- say, a year -- and then get promoted.
> Do we have reasons to expect that the stats will be accurate enough to
> use at that point, or will they be way off?

What do you mean with 'accurate enough'?

With or without shared memory stats pg_stat_all_tables.{n_mod_since_analyze,
n_ins_since_vacuum, n_live_tup, n_dead_tup ...} will be be zero. The replay
process doesn't update them.

In contrast to that, things like pg_stat_all_tables.{seq_scan, seq_tup_read,
idx_tup_fetch, ...} will be accurate, with one exception below.

pg_stat_bgwriter, pg_stat_wal, etc will always be accurate.

On HEAD, there may be a lot of dead stats for dropped databases / tables /
functions that have been dropped since the start of the cluster. They will
eventually get removed, once autovacuum starts running in the respective
database (i.e. pgstat_vacuum_stat() gets run).

The exception noted above is that because pg_stat_all_tables contents are
never removed during recovery, it becomes a lot more plausible for oid
conflicts to occur. So the stats for two different tables might get added up
accidentally - but that'll just affect the non-zero columns, of course.

With the shared memory stats patch, stats for dropped objects (i.e. databases,
tables, ... ) are removed shortly after they have been dropped, so that
conflict risk doesn't exist anymore.

So I don't think increasing inaccuracy is a reason to throw away stats on
replica startup. Particularly because we already don't throw them away when
promoting the replica, just when having started it last.

> I don't have a great understanding of how this all works, but if
> running recovery for a long time is going to lead to a situation where
> the stats progressively diverge from reality, then preserving them
> doesn't seem as valuable as if they're going to be more or less
> accurate.

Minus the oid wraparound risk on HEAD, the only way they increasingly diverge
is that the '0' in a bunch of pg_stat_all_tables columns might get less and
less accurate. But that's not the type of divergence you're talking about, I
think.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-03-30 21:22:58 Re: [Proposal] vacuumdb --schema only
Previous Message David G. Johnston 2022-03-30 21:05:01 Re: pg_stat_reset_single_*_counters vs pg_stat_database.stats_reset