Re: Tracking last scan time

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Dave Page <dpage(at)pgadmin(dot)org>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Tracking last scan time
Date: 2022-08-31 17:52:49
Message-ID: CAEze2WiSvToc1bz2OGz+614COShD5H1Z+NLDSwORfT3p975qgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 31 Aug 2022 at 18:21, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2022-08-23 10:55:09 +0100, Dave Page wrote:
> > Often it is beneficial to review one's schema with a view to removing
> > indexes (and sometimes tables) that are no longer required. It's very
> > difficult to understand when that is the case by looking at the number of
> > scans of a relation as, for example, an index may be used infrequently but
> > may be critical in those times when it is used.
> >
> > The attached patch against HEAD adds optional tracking of the last scan
> > time for relations. It updates pg_stat_*_tables with new last_seq_scan and
> > last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to
> > help with this.
> >
> > Due to the use of gettimeofday(), those values are only maintained if a new
> > GUC, track_scans, is set to on. By default, it is off.
> >
> > I did run a 12 hour test to see what the performance impact is. pgbench was
> > run with scale factor 10000 and 75 users across 4 identical bare metal
> > machines running Rocky 8 in parallel which showed roughly a -2% average
> > performance penalty against HEAD with track_scans enabled. Machines were
> > PowerEdge R7525's with 128GB RAM, dual 16C/32T AMD 7302 CPUs, with the data
> > directory on 6 x 800GB 12Gb/s SSD SAS drives in RAID 0. Kernel time source
> > is tsc.
> >
> > HEAD track_scans Penalty (%)
> > box1 19582.49735 19341.8881 -1.22869541
> > box2 19936.55513 19928.07479 -0.04253664659
> > box3 19631.78895 18649.64379 -5.002830696
> > box4 19810.86767 19420.67192 -1.969604525
> > Average 19740.42728 19335.06965 -2.05343896
>
> Based on the size of those numbers this was a r/w pgbench. If it has this
> noticable an impact for r/w, with a pretty low number of scans/sec, how's the
> overhead for r/o (which can have 2 orders of magnitude more scans/sec)? It
> must be quite bad.
>
> I don't think we should accept this feature with this overhead - but I also
> think we can do better, by accepting a bit less accuracy. For this to be
> useful we don't need a perfectly accurate timestamp. The statement start time
> is probably not accurate enough, but we could just have bgwriter or such
> update one in shared memory every time we wake up? Or perhaps we could go to
> an even lower granularity, by putting in the current LSN or such?

I don't think that LSN is precise enough. For example, if you're in a
(mostly) read-only system, the system may go long times without any
meaningful records being written.

As for having a lower granularity and preventing the
one-syscall-per-Relation issue, can't we reuse the query_start or
state_change timestamps that appear in pg_stat_activity (potentially
updated immediately before this stat flush), or some other per-backend
timestamp that is already maintained and considered accurate enough
for this use?
Regardless, with this patch as it is we get a new timestamp for each
relation processed, which I think is a waste of time (heh) even in
VDSO-enabled systems.

Apart from the above, I don't have any other meaningful opinion on
this patch - it might be a good addition, but I don't consume stats
often enough to make a good cost / benefit comparison.

Kind regards,

Matthias van de Meent

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeremy Schneider 2022-08-31 17:58:14 Re: [PATCH] Query Jumbling for CALL and SET utility statements
Previous Message Michael Banck 2022-08-31 17:51:33 Re: PostgreSQL 15 release announcement draft