Quick Links

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc:	Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, Lukas Fittl <lukas(at)fittl(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject:	Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Date:	2022-11-21 00:38:15
Message-ID:	20221121003815.qnwlnz2lhkow2e5w@awork3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

One good follow up patch will be to rip out the accounting for
pg_stat_bgwriter's buffers_backend, buffers_backend_fsync and perhaps
buffers_alloc and replace it with a subselect getting the equivalent data from
pg_stat_io. It might not be quite worth doing for buffers_alloc because of
the way that's tied into bgwriter pacing.

On 2022-11-03 13:00:24 -0400, Melanie Plageman wrote:
> > + again. A high number of repossessions is a sign of contention for the +
> > blocks operated on by the strategy operation.
> >
> > This (and in general the repossession description) makes sense, but
> > I'm not sure what to do with the information. Maybe Andres is right
> > that we could skip this in the first version?
>
> I've removed repossessed and rejected in attached v37. I am a bit sad
> about this because I don't see a good way forward and I think those
> could be useful for users.

Let's get the basic patch in and then check whether we can find a way to have
something providing at least some more information like repossessed and
rejected. I think it'll be easier to analyze in isolation.

> I have added the new column Andres recommended in [1] ("io_object") to
> clarify temp and local buffers and pave the way for bypass IO (IO not
> done through a buffer pool), which can be done on temp or permanent
> files for temp or permanent relations, and spill file IO which is done
> on temporary files but isn't related to temporary tables.

> IOObject has increased the memory footprint and complexity of the code
> around tracking and accumulating the statistics, though it has not
> increased the number of rows in the view.

It doesn't look too bad from here. Is there a specific portion of the code
where it concerns you the most?

> One question I still have about this additional dimension is how much
> enumeration we need of the various combinations of IO operations, IO
> objects, IO ops, and backend types which are allowed and not allowed.
>
> Currently because it is only valid to operate on both IOOBJECT_RELATION
> and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the
> various functions asserting and validating what is "allowed" in terms of
> combinations of ops, objects, contexts, and backend types aren't much
> different than they were without IO Object. However, once we begin
> adding other objects and contexts, we will need to make this logic more
> comprehensive. I'm not sure whether or not I should do that
> preemptively.

I'd not do it preemptively.

> @@ -833,6 +836,22 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>
> isExtend = (blockNum == P_NEW);
>
> + if (isLocalBuf)
> + {
> + /*
> + * Though a strategy object may be passed in, no strategy is employed
> + * when using local buffers. This could happen when doing, for example,
> + * CREATE TEMPORRARY TABLE AS ...
> + */
> + io_context = IOCONTEXT_BUFFER_POOL;
> + io_object = IOOBJECT_TEMP_RELATION;
> + }
> + else
> + {
> + io_context = IOContextForStrategy(strategy);
> + io_object = IOOBJECT_RELATION;
> + }

I think given how frequently ReadBuffer_common() is called in some workloads,
it'd be good to make IOContextForStrategy inlinable. But I guess that's not
easily doable, because struct BufferAccessStrategyData is only defined in
freelist.c.

Could we defer this until later, given that we don't currently need this in
case of buffer hits afaict?

> @@ -1121,6 +1144,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> BufferAccessStrategy strategy,
> bool *foundPtr)
> {
> + bool from_ring;
> + IOContext io_context;
> BufferTag newTag; /* identity of requested block */
> uint32 newHash; /* hash value for newTag */
> LWLock *newPartitionLock; /* buffer partition lock for it */
> @@ -1187,9 +1212,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> */
> LWLockRelease(newPartitionLock);
>
> + io_context = IOContextForStrategy(strategy);

Hm - doesn't this mean we do IOContextForStrategy() twice? Once in
ReadBuffer_common() and then again here?

> /* Loop here in case we have to try another victim buffer */
> for (;;)
> {
> +
> /*
> * Ensure, while the spinlock's not yet held, that there's a free
> * refcount entry.
> @@ -1200,7 +1228,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> * Select a victim buffer. The buffer is returned with its header
> * spinlock still held!
> */
> - buf = StrategyGetBuffer(strategy, &buf_state);
> + buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
>
> Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
>

I think patch 0001 relies on this change already having been made, If I am not misunderstanding?

> @@ -1263,13 +1291,34 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> }
> }
>
> + /*
> + * When a strategy is in use, only flushes of dirty buffers
> + * already in the strategy ring are counted as strategy writes
> + * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
> + * purpose of IO operation statistics tracking.
> + *
> + * If a shared buffer initially added to the ring must be
> + * flushed before being used, this is counted as an
> + * IOCONTEXT_BUFFER_POOL IOOP_WRITE.
> + *
> + * If a shared buffer added to the ring later because the

Missing word?

> + * current strategy buffer is pinned or in use or because all
> + * strategy buffers were dirty and rejected (for BAS_BULKREAD
> + * operations only) requires flushing, this is counted as an
> + * IOCONTEXT_BUFFER_POOL IOOP_WRITE (from_ring will be false).

I think this makes sense for now, but it'd be good if somebody else could
chime in on this...

> + *
> + * When a strategy is not in use, the write can only be a
> + * "regular" write of a dirty shared buffer (IOCONTEXT_BUFFER_POOL
> + * IOOP_WRITE).
> + */
> +
> /* OK, do the I/O */
> TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
> smgr->smgr_rlocator.locator.spcOid,
> smgr->smgr_rlocator.locator.dbOid,
> smgr->smgr_rlocator.locator.relNumber);
>
> - FlushBuffer(buf, NULL);
> + FlushBuffer(buf, NULL, io_context, IOOBJECT_RELATION);
> LWLockRelease(BufferDescriptorGetContentLock(buf));
> ScheduleBufferTagForWriteback(&BackendWritebackContext,

> + if (oldFlags & BM_VALID)
> + {
> + /*
> + * When a BufferAccessStrategy is in use, evictions adding a
> + * shared buffer to the strategy ring are counted in the
> + * corresponding strategy's context.

Perhaps "adding a shared buffer to the ring are counted in the corresponding
context"? "strategy's context" sounds off to me.

> This includes the evictions
> + * done to add buffers to the ring initially as well as those
> + * done to add a new shared buffer to the ring when current
> + * buffer is pinned or otherwise in use.

I think this sentence could use a few commas, but not sure.

s/current/the current/?

> + * We wait until this point to count reuses and evictions in order to
> + * avoid incorrectly counting a buffer as reused or evicted when it was
> + * released because it was concurrently pinned or in use or counting it
> + * as reused when it was rejected or when we errored out.
> + */

I can't quite parse this sentence.

> + IOOp io_op = from_ring ? IOOP_REUSE : IOOP_EVICT;
> +
> + pgstat_count_io_op(io_op, IOOBJECT_RELATION, io_context);
> + }

I'd just inline the variable, but ...

> @@ -196,6 +197,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> LocalRefCount[b]++;
> ResourceOwnerRememberBuffer(CurrentResourceOwner,
> BufferDescriptorGetBuffer(bufHdr));
> +
> break;
> }
> }

Spurious change.

> pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
>
> *foundPtr = false;
> +
> return bufHdr;
> }

Dito.

> +/*
> +* IO Operation statistics are not collected for all BackendTypes.
> +*
> +* The following BackendTypes do not participate in the cumulative stats
> +* subsystem or do not do IO operations worth reporting statistics on:

s/worth reporting/we currently report/?

> + /*
> + * In core Postgres, only regular backends and WAL Sender processes
> + * executing queries will use local buffers and operate on temporary
> + * relations. Parallel workers will not use local buffers (see
> + * InitLocalBuffers()); however, extensions leveraging background workers
> + * have no such limitation, so track IO Operations on
> + * IOOBJECT_TEMP_RELATION for BackendType B_BG_WORKER.
> + */
> + no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
> + == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
> + B_STANDALONE_BACKEND || bktype == B_STARTUP;
> +
> + if (no_temp_rel && io_context == IOCONTEXT_BUFFER_POOL && io_object ==
> + IOOBJECT_TEMP_RELATION)
> + return false;

Personally I don't like line breaks on the == and would rather break earlier
on the && or ||.

> + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> + {
> + PgStatShared_IOObjectOps *shared_objs = &type_shstats->data[io_context];
> + PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context];
> +
> + for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
> + {

Is there any compiler that'd complain if you used IOContext/IOObject/IOOp as the
type in the for loop? I don't think so? Then you'd not need the casts in other
places, which I think would make the code easier to read.

> + PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object];
> + PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object];
> +
> + if (!expect_backend_stats ||
> + !pgstat_bktype_io_context_io_object_valid(MyBackendType,
> + (IOContext) io_context, (IOObject) io_object))
> + {
> + pgstat_io_context_ops_assert_zero(sharedent);
> + pgstat_io_context_ops_assert_zero(pendingent);
> + continue;
> + }
> +
> + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> + {
> + if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
> + (IOObject) io_object, (IOOp) io_op)))

Superfluous parens after the !, I think?

> void
> pgstat_report_vacuum(Oid tableoid, bool shared,
> @@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
> }
>
> pgstat_unlock_entry(entry_ref);
> +
> + /*
> + * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
> + * Operation stats, however this will not be called after an entire

Missing "until"?

> +static inline void
> +pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
> +{

Does this need to be in pgstat.h? Perhaps pgstat_internal.h would suffice,
afaict it's not used outside of pgstat code?

> +
> +/*
> + * Assert that stats have not been counted for any combination of IOContext,
> + * IOObject, and IOOp which is not valid for the passed-in BackendType. The
> + * passed-in array of PgStat_IOOpCounters must contain stats from the
> + * BackendType specified by the second parameter. Caller is responsible for
> + * locking of the passed-in PgStatShared_IOContextOps, if needed.
> + */
> +static inline void
> +pgstat_backend_io_stats_assert_well_formed(PgStatShared_IOContextOps *backend_io_context_ops,
> + BackendType bktype)
> +{

This doesn't look like it should be an inline function - it's quite long.

I think it's also too complicated for the compiler to optimize out if
assertions are disabled. So you'd need to handle this with an explicit #ifdef
USE_ASSERT_CHECKING.

> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>io_context</structfield> <type>text</type>
> + </para>
> + <para>
> + The context or location of an IO operation.
> + </para>
> + <itemizedlist>
> + <listitem>
> + <para>
> + <varname>io_context</varname> <literal>buffer pool</literal> refers to
> + IO operations on data in both the shared buffer pool and process-local
> + buffer pools used for temporary relation data.
> + </para>
> + <para>

The indentation in the sgml part of the patch seems to be a bit wonky.

> + <para>
> + These last three <varname>io_context</varname>s are counted separately
> + because the autovacuum daemon, explicit <command>VACUUM</command>,
> + explicit <command>ANALYZE</command>, many bulk reads, and many bulk
> + writes use a fixed amount of memory, acquiring the equivalent number of

s/memory/buffers/? The amount of memory isn't really fixed.

> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>read</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Reads by this <varname>backend_type</varname> into buffers in this
> + <varname>io_context</varname>.
> + <varname>read</varname> plus <varname>extended</varname> for
> + <varname>backend_type</varname>s
> +
> + <itemizedlist>
> +
> + <listitem>
> + <para>
> + <literal>autovacuum launcher</literal>
> + </para>
> + </listitem>

Hm. ISTM that we should not document the set of valid backend types as part of
this view. Couldn't we share it with pg_stat_activity.backend_type?

> + The difference is that reads done as part of <command>CREATE
> + DATABASE</command> are not counted in
> + <structname>pg_statio_all_tables</structname> and
> + <structname>pg_stat_database</structname>
> + </para>

Hm, this seems a bit far into the weeds?

> +Datum
> +pg_stat_get_io(PG_FUNCTION_ARGS)
> +{
> + PgStat_BackendIOContextOps *backends_io_stats;
> + ReturnSetInfo *rsinfo;
> + Datum reset_time;
> +
> + InitMaterializedSRF(fcinfo, 0);
> + rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> +
> + backends_io_stats = pgstat_fetch_backend_io_context_ops();
> +
> + reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
> +
> + for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
> + {
> + Datum bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
> + bool expect_backend_stats = true;
> + PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
> +
> + /*
> + * For those BackendTypes without IO Operation stats, skip
> + * representing them in the view altogether.
> + */
> + expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
> + bktype);
> +
> + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> + {
> + const char *io_context_str = pgstat_io_context_desc(io_context);
> + PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context];
> +
> + for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
> + {
> + PgStat_IOOpCounters *counters = &io_objs->data[io_object];
> + const char *io_obj_str = pgstat_io_object_desc(io_object);
> +
> + Datum values[IO_NUM_COLUMNS] = {0};
> + bool nulls[IO_NUM_COLUMNS] = {0};
> +
> + /*
> + * Some combinations of IOContext, IOObject, and BackendType are
> + * not valid for any type of IOOp. In such cases, omit the
> + * entire row from the view.
> + */
> + if (!expect_backend_stats ||
> + !pgstat_bktype_io_context_io_object_valid((BackendType) bktype,
> + (IOContext) io_context, (IOObject) io_object))
> + {
> + pgstat_io_context_ops_assert_zero(counters);
> + continue;
> + }

Perhaps mention in a comment two loops up that we don't skip the nested loops
despite !expect_backend_stats because we want to assert here?

Greetings,

Andres Freund

In response to

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) at 2022-11-03 17:00:24 from Melanie Plageman

Responses

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) at 2022-11-29 02:05:33 from Melanie Plageman

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Isaac Morland	2022-11-21 01:24:11	Understanding WAL - large amount of activity from removing data
Previous Message	Heikki Linnakangas	2022-11-20 23:57:20	Re: Polyphase merge is obsolete