Quick Links

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

From:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Justin Pryzby <pryzby(at)telsasoft(dot)com>, Lukas Fittl <lukas(at)fittl(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject:	Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Date:	2023-01-17 17:22:14
Message-ID:	CAAKRu_bCU3rqyjvWYszSHMvQRRYtJjmzV49MJqsX6wY9ftoazA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

v48 attached.

On Fri, Jan 13, 2023 at 6:36 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2023-01-13 13:38:15 -0500, Melanie Plageman wrote:
> > From f8c9077631169a778c893fd16b7a973ad5725f2a Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres(at)anarazel(dot)de>
> > Date: Fri, 9 Dec 2022 18:23:19 -0800
> > Subject: [PATCH v47 2/5] pgstat: Infrastructure to track IO operations
> > diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
> > index 0fa5370bcd..608c3b59da 100644
> > --- a/src/backend/utils/activity/pgstat.c
> > +++ b/src/backend/utils/activity/pgstat.c
>
> Reminder to self: Need to bump PGSTAT_FILE_FORMAT_ID before commit.
>
> Perhaps you could add a note about that to the commit message?
>

done

>
>
> > @@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
> > .snapshot_cb = pgstat_checkpointer_snapshot_cb,
> > },
> >
> > + [PGSTAT_KIND_IO] = {
> > + .name = "io_ops",
>
> That should be "io" now I think?
>

Oh no! I didn't notice this was broken. I've added pg_stat_have_stats()
to the IO stats tests now.

It would be nice if pgstat_get_kind_from_str() could be used in
pg_stat_reset_shared() to avoid having to remember to change both. It
doesn't really work because we want to be able to throw the error
message in pg_stat_reset_shared() when the user input is wrong -- not
the one in pgstat_get_kind_from_str().
Also:
- Since recovery_prefetch doesn't have a statistic kind, it doesn't fit
well into this paradigm
- Only a subset of the statistics kinds are reset through this function
- bgwriter and checkpointer share a reset target
I added a comment -- perhaps that's all I can do?

On a separate note, should we be setting have_[io/slru/etc]stats to
false in the reset all functions?

>
> > +/*
> > + * Check that stats have not been counted for any combination of IOContext,
> > + * IOObject, and IOOp which are not tracked for the passed-in BackendType. The
> > + * passed-in PgStat_BackendIO must contain stats from the BackendType specified
> > + * by the second parameter. Caller is responsible for locking the passed-in
> > + * PgStat_BackendIO, if needed.
> > + */
>
> Other PgStat_Backend* structs are just for pending data. Perhaps we could
> rename it slightly to make that clearer? PgStat_BktypeIO?
> PgStat_IOForBackendType? or a similar variation?

I've done this.

>
> > +bool
> > +pgstat_bktype_io_stats_valid(PgStat_BackendIO *backend_io,
> > + BackendType bktype)
> > +{
> > + bool bktype_tracked = pgstat_tracks_io_bktype(bktype);
> > +
> > + for (IOContext io_context = IOCONTEXT_FIRST;
> > + io_context < IOCONTEXT_NUM_TYPES; io_context++)
> > + {
> > + for (IOObject io_object = IOOBJECT_FIRST;
> > + io_object < IOOBJECT_NUM_TYPES; io_object++)
> > + {
> > + /*
> > + * Don't bother trying to skip to the next loop iteration if
> > + * pgstat_tracks_io_object() would return false here. We still
> > + * need to validate that each counter is zero anyway.
> > + */
> > + for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
> > + {
> > + if ((!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) &&
> > + backend_io->data[io_context][io_object][io_op] != 0)
> > + return false;
>
> Hm, perhaps this could be broken up into multiple lines? Something like
>
> /* no stats, so nothing to validate */
> if (backend_io->data[io_context][io_object][io_op] == 0)
> continue;
>
> /* something went wrong if have stats for something not tracked */
> if (!bktype_tracked ||
> !pgstat_tracks_io_op(bktype, io_context, io_object, io_op))
> return false;

I've done this.

> > +typedef struct PgStat_BackendIO
> > +{
> > + PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
> > +} PgStat_BackendIO;
>
> Would it bother you if we swapped the order of iocontext and iobject here and
> related places? It makes more sense to me semantically, and should now be
> pretty easy, code wise.

So, thinking about this I started noticing inconsistencies in other
areas around this order:
For example: ordering of objects mentioned in commit messages and comments,
ordering of parameters (like in pgstat_count_io_op() [currently in
reverse order]).

I think we should make a final decision about this ordering and then
make everywhere consistent (including ordering in the view).

Currently the order is:
BackendType
IOContext
IOObject
IOOp

You are suggesting this order:
BackendType
IOObject
IOContext
IOOp

Could you explain what you find more natural about this ordering (as I
find the other more natural)?

This is one possible natural sentence with these objects:

During COPY, a client backend may read in data from a permanent
relation.
This order is:
IOContext
BackendType
IOOp
IOObject

I think English sentences are often structured subject, verb, object --
but in our case, we have an extra thing that doesn't fit neatly
(IOContext). Also, IOOp in a sentence would be in the middle (as the
verb). I made it last because a) it feels like the smallest unit b) it
would make the code a lot more annoying if it wasn't last.

WRT IOObject and IOContext, is there a future case for which having
IOObject first will be better or lead to fewer mistakes?

I actually see loads of places where this needs to be made consistent.

>
> > +/* shared version of PgStat_IO */
> > +typedef struct PgStatShared_IO
> > +{
>
> Maybe /* PgStat_IO in shared memory */?
>

updated.

>
> > Subject: [PATCH v47 3/5] pgstat: Count IO for relations
>
> Nearly happy with this now. See one minor nit below.
>
> I don't love the counting in register_dirty_segment() and mdsyncfiletag(), but
> I don't have a better idea, and it doesn't seem too horrible.

You don't like it because such things shouldn't be in md.c -- since we
went to the trouble of having function pointers and making it general?

>
> > @@ -1441,6 +1474,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> >
> > UnlockBufHdr(buf, buf_state);
> >
> > + if (oldFlags & BM_VALID)
> > + {
> > + /*
> > + * When a BufferAccessStrategy is in use, blocks evicted from shared
> > + * buffers are counted as IOOP_EVICT in the corresponding context
> > + * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
> > + * strategy in two cases: 1) while initially claiming buffers for the
> > + * strategy ring 2) to replace an existing strategy ring buffer
> > + * because it is pinned or in use and cannot be reused.
> > + *
> > + * Blocks evicted from buffers already in the strategy ring are
> > + * counted as IOOP_REUSE in the corresponding strategy context.
> > + *
> > + * At this point, we can accurately count evictions and reuses,
> > + * because we have successfully claimed the valid buffer. Previously,
> > + * we may have been forced to release the buffer due to concurrent
> > + * pinners or erroring out.
> > + */
> > + pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
> > + IOOBJECT_RELATION, *io_context);
> > + }
> > +
> > if (oldPartitionLock != NULL)
> > {
> > BufTableDelete(&oldTag, oldHash);
>
> There's no reason to do this while we still hold the buffer partition lock,
> right? That's a highly contended lock, and we can just move the counting a few
> lines down.

Thanks, I've done this.

>
> > @@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path)
> > if (need_to_close)
> > FileClose(file);
> >
> > + if (result >= 0)
> > + pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
> > +
>
> I'd lean towards doing this unconditionally, it's still an fsync if it
> failed... Not that it matters.

Good point. We still incurred the costs if not benefited from the
effects. I've updated this.

>
> > Subject: [PATCH v47 4/5] Add system view tracking IO ops per backend type
>
> Note to self + commit message: Remember the need to do a catversion bump.

Noted.

>
> > +-- pg_stat_io test:
> > +-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy.
>
> Maybe add that "whereas a sequential scan does not, see ..."?

Updated.

>
> > This allows
> > +-- us to reliably test that pg_stat_io BULKREAD reads are being captured
> > +-- without relying on the size of shared buffers or on an expensive operation
> > +-- like CREATE DATABASE.
>
> CREATE / DROP TABLESPACE is also pretty expensive, but I don't have a better
> idea.

I've added a comment.

>
> > +-- Create an alternative tablespace and move the heaptest table to it, causing
> > +-- it to be rewritten.
>
> IIRC the point of that is that it reliably evicts all the buffers from s_b,
> correct? If so, mention that?

Done.

>
> > +Datum
> > +pg_stat_get_io(PG_FUNCTION_ARGS)
> > +{
> > + ReturnSetInfo *rsinfo;
> > + PgStat_IO *backends_io_stats;
> > + Datum reset_time;
> > +
> > + InitMaterializedSRF(fcinfo, 0);
> > + rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> > +
> > + backends_io_stats = pgstat_fetch_stat_io();
> > +
> > + reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
> > +
> > + for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
> > + {
> > + bool bktype_tracked;
> > + Datum bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
> > + PgStat_BackendIO *bktype_stats = &backends_io_stats->stats[bktype];
> > +
> > + /*
> > + * For those BackendTypes without IO Operation stats, skip
> > + * representing them in the view altogether. We still loop through
> > + * their counters so that we can assert that all values are zero.
> > + */
> > + bktype_tracked = pgstat_tracks_io_bktype(bktype);
>
> How about instead just doing Assert(pgstat_bktype_io_stats_valid(...))? That
> deduplicates the logic for the asserts, and avoids doing the full loop when
> assertions aren't enabled anyway?
>

I've done this and added a comment.

>
>
> > +-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
> > +-- and fsyncs.
> > +-- The second checkpoint ensures that stats from the first checkpoint have been
> > +-- reported and protects against any potential races amongst the table
> > +-- creation, a possible timing-triggered checkpoint, and the explicit
> > +-- checkpoint in the test.
>
> There's a comment about the subsequent checkpoints earlier in the file, and I
> think the comment is slightly more precise. Mybe just reference the earlier comment?
>
>
> > +-- Change the tablespace so that the table is rewritten directly, then SELECT
> > +-- from it to cause it to be read back into shared buffers.
> > +SET allow_in_place_tablespaces = true;
> > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
>
> Perhaps worth doing this in tablespace.sql, to avoid the additional
> checkpoints done as part of CREATE/DROP TABLESPACE?
>
> Or, at least combine this with the CHECKPOINTs above?

I see a checkpoint is requested when dropping the tablespace if not all
the files in it are deleted. It seems like if the DROP TABLE for the
permanent table is before the explicit checkpoints in the test, then the
DROP TABLESPACE will not cause an additional checkpoint. Is this what
you are suggesting? Dropping the temporary table should not have an
effect on this.

>
> > +-- Drop the table so we can drop the tablespace later.
> > +DROP TABLE test_io_shared;
> > +-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
> > +-- - eviction of local buffers in order to reuse them
> > +-- - reads of temporary table blocks into local buffers
> > +-- - writes of local buffers to permanent storage
> > +-- - extends of temporary tables
> > +-- Set temp_buffers to a low value so that we can trigger writes with fewer
> > +-- inserted tuples. Do so in a new session in case temporary tables have been
> > +-- accessed by previous tests in this session.
> > +\c
> > +SET temp_buffers TO '1MB';
>
> I'd set it to the actual minimum '100' (in pages). Perhaps that'd allow to
> make test_io_local a bit smaller?

I've done this.

>
> > +CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
> > +SELECT sum(extends) AS io_sum_local_extends_before
> > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
> > +SELECT sum(evictions) AS io_sum_local_evictions_before
> > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
> > +SELECT sum(writes) AS io_sum_local_writes_before
> > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
> > +-- Insert tuples into the temporary table, generating extends in the stats.
> > +-- Insert enough values that we need to reuse and write out dirty local
> > +-- buffers, generating evictions and writes.
> > +INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
> > +SELECT sum(reads) AS io_sum_local_reads_before
> > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
>
> Maybe add something like
>
> SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;
>
> Better toast compression or such could easily make test_io_local smaller than
> it's today. Seeing that it's too small would make it easier to understand the
> failure.

Good idea. So, I used pg_table_size() because it seems like
pg_relation_size() does not include the toast relations. However, I'm
not sure this is a good idea, because pg_table_size() includes FSM and
visibility map. Should I write a query to get the toast relation name
and add pg_relation_size() of that relation and the main relation?

>
> > +SELECT sum(evictions) AS io_sum_local_evictions_after
> > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
> > +SELECT sum(reads) AS io_sum_local_reads_after
> > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
> > +SELECT sum(writes) AS io_sum_local_writes_after
> > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
> > +SELECT sum(extends) AS io_sum_local_extends_after
> > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
>
> This could just be one select with multiple columns?
>
> I think if you use something like \gset io_sum_local_after_ you can also avoid
> the need to repeat "io_sum_local_" so many times.

Thanks. I didn't realize. I've fixed this throughout the test file.

On Mon, Jan 16, 2023 at 4:42 PM Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com> wrote:
> I missed a couple of versions, but I think the docs are clearer now.
> I'm torn on losing some of the detail, but overall I do think it's a
> good trade-off. Moving some details out to after the table does keep
> the bulk of the view documentation more readable, and the "inform
> database tuning" part is great. I really like the idea of a separate
> Interpreting Statistics section, but for now this works.
>
> >+ <literal>vacuum</literal>: I/O operations performed outside of shared
> >+ buffers while vacuuming and analyzing permanent relations.
>
> Why only permanent relations? Are temporary relations treated
> differently? I imagine if someone has a temp-table-heavy workload that
> requires regularly vacuuming and analyzing those relations, this point
> may be confusing without some additional explanation.

Ah, yes. This is a bit confusing. We don't use buffer access strategies
when operating on temp relations, so vacuuming them is counted in IO
Context normal. I've added this information to the docs but now that
definition is a bit long. Perhaps it should be a note? That seems like
it would draw too much attention to this detail, though...

- Melanie

Attachment	Content-Type	Size
v48-0002-pgstat-Count-IO-for-relations.patch	text/x-patch	22.2 KB
v48-0003-Add-system-view-tracking-IO-ops-per-backend-type.patch	text/x-patch	32.6 KB
v48-0004-pg_stat_io-documentation.patch	text/x-patch	13.5 KB
v48-0001-pgstat-Infrastructure-to-track-IO-operations.patch	text/x-patch	28.6 KB

In response to

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) at 2023-01-13 23:36:00 from Andres Freund

Responses

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) at 2023-01-17 19:12:32 from Andres Freund
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) at 2023-01-18 06:10:38 from Maciek Sakrejda

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2023-01-17 17:26:57	Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Previous Message	Arul Ajmani	2023-01-17 17:18:28	SHARED locks barging behaviour