Re: Track IO times in pg_stat_io

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, smilingsamay(at)gmail(dot)com, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Lukas Fittl <lukas(at)fittl(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Track IO times in pg_stat_io
Date: 2023-03-14 22:56:15
Message-ID: 20230314225615.4twkbpcl6mtini4a@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote:
> On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
> > > > to also check that if counts are not Zero then times are not Zero.
> > >
> > > Yes, I think adding some validation around the relationship between
> > > counts and timing should help prevent developers from forgetting to call
> > > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> > >
> > > However, I think that we cannot check that if IO counts are non-zero
> > > that IO times are non-zero, because the user may not have
> > > track_io_timing enabled. We can check that if IO times are not zero, IO
> > > counts are not zero. I've done this in the attached v3.
> >
> > And even if track_io_timing is enabled, the timer granularity might be so low
> > that we *still* get zeroes.
> >
> > I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,
>
> And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> stats?

Yes.

> > > @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> > >
> > > if (isExtend)
> > > {
> > > + instr_time io_start,
> > > + io_time;
> > > +
> > > /* new buffers are zero-filled */
> > > MemSet((char *) bufBlock, 0, BLCKSZ);
> > > +
> > > + if (track_io_timing)
> > > + INSTR_TIME_SET_CURRENT(io_start);
> > > + else
> > > + INSTR_TIME_SET_ZERO(io_start);
> > > +
> >
> > I wonder if there's an argument for tracking this in the existing IO stats as
> > well. But I guess we've lived with this for a long time...
>
> Not sure I want to include that in this patchset.

No, probably not.

> > > typedef struct PgStat_BktypeIO
> > > {
> > > - PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > + PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > + instr_time times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > } PgStat_BktypeIO;
> >
> > Ah, you're going to hate me. We can't store instr_time on disk. There's
> > another patch that gets substantial peformance gains by varying the frequency
> > at which instr_time keeps track of time based on the CPU frequency...
>
> What does that have to do with what we can store on disk?

The frequency can change.

> If so, would it not be enough to do this when reading/writing the stats
> file?

Theoretically yes. But to me it seems cleaner to do it when flushing to shared
stats. See also the overflow issue below.

> void
> instr_time_deserialize(instr_time *dest, int64 *src, int length)
> {
> for (size_t i = 0; i < length; i++)
> {
> INSTR_TIME_SET_ZERO(dest[i]);
> dest[i].ticks = src[i];
> }
> }

That wouldn't be correct, because what ticks means will at some point change
between postgres stopping and starting.

> > It also just doesn't have enough range to keep track of system wide
> > time on a larger system. A single backend won't run for 293 years, but
> > with a few thousand backends that's a whole different story.
> >
> > I think we need to accumulate in instr_time, but convert to floating point
> > when flushing stats.
>
> Hmmm. So, are you saying that we need to read from disk when we query
> the view and add that to what is in shared memory? That we only store
> the delta since the last restart in the instr_time array?

No, I don't think I am suggesting that. What I am trying to suggest is that
PendingIOStats should contain instr_time, but that PgStat_IO should contain
PgStat_Counter as microseconds, as before.

> But, I don't see how that avoids the problem of backend total runtime
> being 293 years. We would have to reset and write out the delta whenever
> we thought the times would overflow.

The overflow risk is due to storing nanoseconds (which is what instr_time
stores internally on linux now) - which we don't need to do once
accumulatated. Right now we store them as microseconds.

nanosecond range:
((2**63) - 1)/(10**9*60*60*24*365) -> 292 years
microsecond range:
((2**63) - 1)/(10**6*60*60*24*365) -> 292471 years

If you assume 5k connections continually doing IO, a range of 292 years would
last 21 days at nanosecond resolution. At microsecond resolution it's 58
years.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2023-03-14 22:56:50 Re: Add pg_walinspect function with block info columns
Previous Message Melanie Plageman 2023-03-14 22:34:09 Re: Add pg_walinspect function with block info columns