Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Lukas Fittl <lukas(at)fittl(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Date: 2023-01-17 19:12:32
Message-ID: 20230117191232.mklhdx4pyavwxmpf@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote:
> > > @@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
> > > .snapshot_cb = pgstat_checkpointer_snapshot_cb,
> > > },
> > >
> > > + [PGSTAT_KIND_IO] = {
> > > + .name = "io_ops",
> >
> > That should be "io" now I think?
> >
>
> Oh no! I didn't notice this was broken. I've added pg_stat_have_stats()
> to the IO stats tests now.
>
> It would be nice if pgstat_get_kind_from_str() could be used in
> pg_stat_reset_shared() to avoid having to remember to change both.

It's hard to make that work, because of the historical behaviour of that
function :(

> Also:
> - Since recovery_prefetch doesn't have a statistic kind, it doesn't fit
> well into this paradigm

I think that needs a rework anyway - it went in at about the same time as the
shared mem stats patch, so it doesn't quite cohere.

> On a separate note, should we be setting have_[io/slru/etc]stats to
> false in the reset all functions?

That'd not work reliably, because other backends won't do the same. I don't
see a benefit in doing it differently in the local connection than the other
connections.

> > > +typedef struct PgStat_BackendIO
> > > +{
> > > + PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
> > > +} PgStat_BackendIO;
> >
> > Would it bother you if we swapped the order of iocontext and iobject here and
> > related places? It makes more sense to me semantically, and should now be
> > pretty easy, code wise.
>
> So, thinking about this I started noticing inconsistencies in other
> areas around this order:
> For example: ordering of objects mentioned in commit messages and comments,
> ordering of parameters (like in pgstat_count_io_op() [currently in
> reverse order]).
>
> I think we should make a final decision about this ordering and then
> make everywhere consistent (including ordering in the view).
>
> Currently the order is:
> BackendType
> IOContext
> IOObject
> IOOp
>
> You are suggesting this order:
> BackendType
> IOObject
> IOContext
> IOOp
>
> Could you explain what you find more natural about this ordering (as I
> find the other more natural)?

The object we're performing IO on determines more things than the context. So
it just seems like the natural hierarchical fit. The context is a sub-category
of the object. Consider how it'll look like if we also have objects for 'wal',
'temp files'. It'll make sense to group by just the object, but it won't make
sense to group by just the context.

If it were trivial to do I'd use a different IOContext for each IOObject. But
it'd make it much harder. So there'll just be a bunch of values of IOContext
that'll only be used for one or a subset of the IOObjects.

The reason to put BackendType at the top is pragmatic - one backend is of a
single type, but can do IO for all kinds of objects/contexts. So any other
hierarchy would make the locking etc much harder.

> This is one possible natural sentence with these objects:
>
> During COPY, a client backend may read in data from a permanent
> relation.
> This order is:
> IOContext
> BackendType
> IOOp
> IOObject
>
> I think English sentences are often structured subject, verb, object --
> but in our case, we have an extra thing that doesn't fit neatly
> (IOContext).

"..., to avoid polluting the buffer cache it uses the bulk (read|write)
strategy".

> Also, IOOp in a sentence would be in the middle (as the
> verb). I made it last because a) it feels like the smallest unit b) it
> would make the code a lot more annoying if it wasn't last.

Yea, I think pragmatically that is the right choice.

> > > Subject: [PATCH v47 3/5] pgstat: Count IO for relations
> >
> > Nearly happy with this now. See one minor nit below.
> >
> > I don't love the counting in register_dirty_segment() and mdsyncfiletag(), but
> > I don't have a better idea, and it doesn't seem too horrible.
>
> You don't like it because such things shouldn't be in md.c -- since we
> went to the trouble of having function pointers and making it general?

It's more of a gut feeling than well reasoned ;)

> > > +-- Change the tablespace so that the table is rewritten directly, then SELECT
> > > +-- from it to cause it to be read back into shared buffers.
> > > +SET allow_in_place_tablespaces = true;
> > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
> >
> > Perhaps worth doing this in tablespace.sql, to avoid the additional
> > checkpoints done as part of CREATE/DROP TABLESPACE?
> >
> > Or, at least combine this with the CHECKPOINTs above?
>
> I see a checkpoint is requested when dropping the tablespace if not all
> the files in it are deleted. It seems like if the DROP TABLE for the
> permanent table is before the explicit checkpoints in the test, then the
> DROP TABLESPACE will not cause an additional checkpoint.

Unfortunately, that's not how it works :(. See the comment above mdunlink():

> * For regular relations, we don't unlink the first segment file of the rel,
> * but just truncate it to zero length, and record a request to unlink it after
> * the next checkpoint. Additional segments can be unlinked immediately,
> * however. Leaving the empty file in place prevents that relfilenumber
> * from being reused. The scenario this protects us from is:
> ...

> Is this what you are suggesting? Dropping the temporary table should not
> have an effect on this.

I was wondering about simply moving that portion of the test to
tablespace.sql, where we already created a tablespace.

An alternative would be to propose splitting tablespace.sql into one portion
running at the start of parallel_schedule, and one at the end. Historically,
we needed tablespace.sql to be optional due to causing problems when
replicating to another instance on the same machine, but now we have
allow_in_place_tablespaces.

> > SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;
> >
> > Better toast compression or such could easily make test_io_local smaller than
> > it's today. Seeing that it's too small would make it easier to understand the
> > failure.
>
> Good idea. So, I used pg_table_size() because it seems like
> pg_relation_size() does not include the toast relations. However, I'm
> not sure this is a good idea, because pg_table_size() includes FSM and
> visibility map. Should I write a query to get the toast relation name
> and add pg_relation_size() of that relation and the main relation?

I think it's the right thing to just include the relation size. Your queries
IIRC won't use the toast table or other forks. So I'd leave it at just
pg_relation_size().

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-01-17 19:16:08 Re: Refactor recordExtObjInitPriv()
Previous Message Andres Freund 2023-01-17 18:50:53 Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?