Re: Track IO times in pg_stat_io

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, smilingsamay(at)gmail(dot)com, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Lukas Fittl <lukas(at)fittl(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Track IO times in pg_stat_io
Date: 2023-03-09 16:50:38
Message-ID: CAAKRu_YZroS-LqiQqWwbKyjX47kfmxvU2Oy6=bSu=TMj_xbFhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, v4 attached addresses these review comments.

On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
> > > to also check that if counts are not Zero then times are not Zero.
> >
> > Yes, I think adding some validation around the relationship between
> > counts and timing should help prevent developers from forgetting to call
> > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> >
> > However, I think that we cannot check that if IO counts are non-zero
> > that IO times are non-zero, because the user may not have
> > track_io_timing enabled. We can check that if IO times are not zero, IO
> > counts are not zero. I've done this in the attached v3.
>
> And even if track_io_timing is enabled, the timer granularity might be so low
> that we *still* get zeroes.
>
> I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,

And then have pg_stat_reset_shared('io') reset pg_stat_database IO
stats?

> > @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> >
> > if (isExtend)
> > {
> > + instr_time io_start,
> > + io_time;
> > +
> > /* new buffers are zero-filled */
> > MemSet((char *) bufBlock, 0, BLCKSZ);
> > +
> > + if (track_io_timing)
> > + INSTR_TIME_SET_CURRENT(io_start);
> > + else
> > + INSTR_TIME_SET_ZERO(io_start);
> > +
>
> I wonder if there's an argument for tracking this in the existing IO stats as
> well. But I guess we've lived with this for a long time...

Not sure I want to include that in this patchset.

> > @@ -2981,16 +2998,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
> > * When a strategy is not in use, the write can only be a "regular" write
> > * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
> > */
> > - pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
> > -
> > if (track_io_timing)
> > {
> > INSTR_TIME_SET_CURRENT(io_time);
> > INSTR_TIME_SUBTRACT(io_time, io_start);
> > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> > INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> > + pgstat_count_io_time(IOOBJECT_RELATION, io_context, IOOP_WRITE, io_time);
> > }
>
> I think this needs a bit of cleanup - pgstat_count_buffer_write_time(),
> pgBufferUsage.blk_write_time++, pgstat_count_io_time() is a bit excessive. We
> might not be able to reduce the whole duplication at this point, but at least
> it should be a bit more centralized.

So, in the attached v4, I've introduced pgstat_io_start() and
pgstat_io_end(...). The end IO function takes the IOObject, IOOp, and
IOContext, in addition to the start_time, so that we know which
pgBufferUsage field to increment and which pgstat_count_buffer_*_time()
to call.

I will note that calling this function now causes pgBufferUsage and
pgStatBlock*Time to be incremented in a couple of places that they were
not before. I think those might have been accidental omissions, so I
think it is okay.

The exception is pgstat_count_write_time() being only called for
relations in shared buffers and not temporary relations while
pgstat_count_buffer_read_time() is called for temporary relations and
relations in shared buffers. I left that behavior as is, though it seems
like it is wrong.

I added pgstat_io_start() to pgstat.c -- not sure if it is best there.

I could separate it into a commit that does this refactoring of the
existing counting (without adding pgstat_count_io_time()) and then
another that adds pgstat_count_io_time(). I hesitated to do that until I
knew that the new functions were viable.

> > + pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
> > pgBufferUsage.shared_blks_written++;
> >
> > /*
> > @@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel)
> >
> > if (RelationUsesLocalBuffers(rel))
> > {
> > + instr_time io_start,
> > + io_time;
> > +
> > for (i = 0; i < NLocBuffer; i++)
> > {
> > uint32 buf_state;
> > @@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel)
> >
> > PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
> >
> > + if (track_io_timing)
> > + INSTR_TIME_SET_CURRENT(io_start);
> > + else
> > + INSTR_TIME_SET_ZERO(io_start);
> > +
> > smgrwrite(RelationGetSmgr(rel),
> > BufTagGetForkNum(&bufHdr->tag),
> > bufHdr->tag.blockNum,
>
> I don't think you need the INSTR_TIME_SET_ZERO() in the body of the loop, to
> silence the compiler warnings you can do it one level up.

So, I didn't move it out because I am using pgstat_io_start() which does
set zero. However, I could eschew the pgstat_io_start() helper function
and just do what is in the function inline. Do you think the overhead of
set zero is worth it?

> > @@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> >
> > PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
> >
> > + if (track_io_timing)
> > + INSTR_TIME_SET_CURRENT(io_start);
> > + else
> > + INSTR_TIME_SET_ZERO(io_start);
> > +
> > /* And write... */
> > smgrwrite(oreln,
> > BufTagGetForkNum(&bufHdr->tag),
> > @@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> > buf_state &= ~BM_DIRTY;
> > pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
> >
> > + if (track_io_timing)
> > + {
> > + INSTR_TIME_SET_CURRENT(io_time);
> > + INSTR_TIME_SUBTRACT(io_time, io_start);
> > + pgstat_count_io_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
> > + }
> > +
> > pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
> > pgBufferUsage.local_blks_written++;
> > }
>
> Perhaps we can instead introduce a FlushLocalBuffer()? Then we don't need this
> in multiple write paths.

FlushLocalBuffer() is a good idea. It would be nice to have it contain
more than just
pgstat_io_start()
smgrwrite()
pgstat_io_end()
e.g. to have it include checksumming and marking dirty (more like
normal FlushBuffer()). I noticed that LocalBufferAlloc() does not set up
error traceback support for ereport and FlushRelationBuffers() does. Is
this intentional?

> > diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
> > index 352958e1fe..052875d86a 100644
> > --- a/src/backend/storage/smgr/md.c
> > +++ b/src/backend/storage/smgr/md.c
> > @@ -1030,6 +1030,30 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
> >
> > if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
> > {
> > + instr_time io_start,
> > + io_time;
> > +
> > + if (track_io_timing)
> > + INSTR_TIME_SET_CURRENT(io_start);
> > + else
> > + INSTR_TIME_SET_ZERO(io_start);
> > +
> > + ereport(DEBUG1,
> > + (errmsg_internal("could not forward fsync request because request queue is full")));
> > +
> > + if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
> > + ereport(data_sync_elevel(ERROR),
> > + (errcode_for_file_access(),
> > + errmsg("could not fsync file \"%s\": %m",
> > + FilePathName(seg->mdfd_vfd))));
> > +
> > + if (track_io_timing)
> > + {
> > + INSTR_TIME_SET_CURRENT(io_time);
> > + INSTR_TIME_SUBTRACT(io_time, io_start);
> > + pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
> > + }
> > +
> > /*
> > * We have no way of knowing if the current IOContext is
> > * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
> > @@ -1042,15 +1066,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
> > * backend fsyncs.
> > */
> > pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
> > -
> > - ereport(DEBUG1,
> > - (errmsg_internal("could not forward fsync request because request queue is full")));
> > -
> > - if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
> > - ereport(data_sync_elevel(ERROR),
> > - (errcode_for_file_access(),
> > - errmsg("could not fsync file \"%s\": %m",
> > - FilePathName(seg->mdfd_vfd))));
> > }
> > }
> >
> > @@ -1399,6 +1414,8 @@ int
> > mdsyncfiletag(const FileTag *ftag, char *path)
> > {
> > SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
> > + instr_time io_start,
> > + io_time;
> > File file;
> > bool need_to_close;
> > int result,
> > @@ -1425,10 +1442,22 @@ mdsyncfiletag(const FileTag *ftag, char *path)
> > need_to_close = true;
> > }
> >
> > + if (track_io_timing)
> > + INSTR_TIME_SET_CURRENT(io_start);
> > + else
> > + INSTR_TIME_SET_ZERO(io_start);
> > +
> > /* Sync the file. */
> > result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
> > save_errno = errno;
> >
> > + if (track_io_timing)
> > + {
> > + INSTR_TIME_SET_CURRENT(io_time);
> > + INSTR_TIME_SUBTRACT(io_time, io_start);
> > + pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
> > + }
> > +
> > if (need_to_close)
> > FileClose(file);
>
> Perhaps we could have mdsyncfd(), used by both mdsyncfiletag() and
> register_dirty_segment()?

I agree it would be nice, but it seems like it would take a little bit
of work and might not be worth doing that in this patchset.

> > @@ -1359,20 +1378,31 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
> >
> > for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> > {
> > - int col_idx = pgstat_get_io_op_index(io_op);
> > + int i = pgstat_get_io_op_index(io_op);
> >
> > /*
> > * Some combinations of BackendType and IOOp, of IOContext
> > * and IOOp, and of IOObject and IOOp are not tracked. Set
> > * these cells in the view NULL.
> > */
> > - nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
> > + if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
> > + values[i] = Int64GetDatum(bktype_stats->counts[io_obj][io_context][io_op]);
> > + else
> > + nulls[i] = true;
> > + }
>
> These lines were already too long, and it's getting worse with this change.

I've started using local variables.

> > typedef struct PgStat_BktypeIO
> > {
> > - PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > + PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > + instr_time times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > } PgStat_BktypeIO;
>
> Ah, you're going to hate me. We can't store instr_time on disk. There's
> another patch that gets substantial peformance gains by varying the frequency
> at which instr_time keeps track of time based on the CPU frequency...

What does that have to do with what we can store on disk?

If so, would it not be enough to do this when reading/writing the stats
file?

void
instr_time_deserialize(instr_time *dest, int64 *src, int length)
{
for (size_t i = 0; i < length; i++)
{
INSTR_TIME_SET_ZERO(dest[i]);
dest[i].ticks = src[i];
}
}

void
instr_time_serialize(int64 *dest, instr_time *src, int length)
{
for (size_t i = 0; i < length; i++)
dest[i] = INSTR_TIME_GET_NANOSEC(src[i]);

}

> It also just doesn't have enough range to keep track of system wide
> time on a larger system. A single backend won't run for 293 years, but
> with a few thousand backends that's a whole different story.
>
> I think we need to accumulate in instr_time, but convert to floating point
> when flushing stats.

Hmmm. So, are you saying that we need to read from disk when we query
the view and add that to what is in shared memory? That we only store
the delta since the last restart in the instr_time array?

But, I don't see how that avoids the problem of backend total runtime
being 293 years. We would have to reset and write out the delta whenever
we thought the times would overflow.

But, maybe I am misunderstanding something.

- Melanie

Attachment Content-Type Size
v4-0001-Track-IO-times-in-pg_stat_io.patch text/x-patch 25.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2023-03-09 17:34:05 Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences
Previous Message Justin Pryzby 2023-03-09 16:38:56 Re: Improve logging when using Huge Pages