Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
Date: 2012-02-22 19:11:17
Message-ID: CAEYLb_XiYvpzrAUKG=NxJKR7KCzKHyhf0ubWMWZ+pfqO_eW7qA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 16 January 2012 06:28, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> One of the most useful bits of feedback on how well checkpoint I/O is going
> is the amount of time taken to sync files to disk.  Right now the only way
> to get that is to parse the logs.  The attached patch publishes the most
> useful three bits of data you could only get from log_checkpoints before out
> to pg_stat_bgwriter.  Easiest to just show an example:

This is a preliminary review of the patch. Attached is a revision of
the patch that clears up some minor bit-rot and style issues, just as
a convenience to everyone else who might like to try it out against
master who doesn't want to repeat my work. There were oid collisions
against master, and an upstream change in the pg_proc format with the
introduction of the "proleakproof" column. Regression tests pass (as
they must have in the prior revision, as the original authors modified
"expected/rules.out"). Note that I have deliberately avoided changing
the patch further than that, though I believe that further changes
will be necessary.

One beef that I have with the variable name m_write_ms is that "ms"
could equally well refer to microseconds or milliseconds, and these
mistakes are very common. Therefore, you might think about making the
fields of type "instr_time", which is consistent with some other such
fields used within the statistics collector - this takes care of
issues with overflows, portability and confusion over units, to as
great an extent as is possible. I'm not even confident that this is
the right thing though, because PgStat_StatFuncEntry, for example,
*does* use PgStat_Countuer as the type for time values, in its case
apparently *microseconds* rather than milliseconds, even though the
pg_stat_user_functions view displays values in millisecond (it's
incorrectly commented). Even the use of instr_time within some
pgstat.h structs sees the value converted to microseconds for
transmission to the collector. It might be that since we don't really
accumulate time values here, it's better to just directly represent
that value as PgStat_Counter, but in that case I'd still clearly mark
the value as being milliseconds. In any case, I would have the columns
in milliseconds, but not ending with *_ms within pg_stat_bgwriter,
for consistency with pg_stat_user_functions et al.

There is no reason to cast to uint64 when accumulating milliseconds,
as the lvalue is currently a PgStat_Counter.

Details of existing comment bug: The docs show that
pg_stat_*_functions accumulates time in units of milliseconds.
However, a PgStat_FunctionEntry payload is commented as sending the
time/self time to the stats collector in microseconds. So this
comment, in the existing code, is actually wrong:

PgStat_Counter f_time; /* times in microseconds */
PgStat_Counter f_time_self;
} PgStat_StatFuncEntry;

Can someone commit a fix for that please? It could easily result in a
user-visible bug. Mixing units, and representing them as simple
integers and totally eschewing type safety is sort of a pet peeve of
mine, but consistency with the existing code is more important.

My general impression of the patch's code quality is that it is more
or less idomatic, and, as Greg said, simple; it generalises from the
style of existing code in LogCheckpointEnd(), as well as existing
fields within the PgStat_MsgBgWriter and PgStat_GlobalStats structs .

As for the substance of the patch, I am in general agreement that this
is a good idea. Storing the statistics in pg_stat_bgwriter is a more
flexible approach than is immediately apparent. Users can usefully
calculate delta values, as part of the same process through which
checkpoint tuning is performed by comparing output from "select now(),
* from pg_stat_bgwriter" at different intervals. This also puts this
information within easy reach of monitoring tools.

So, I'd ask Greg and/or Jaime to produce a revision of the patch with
those concerns in mind, as well as fixing the md.c usage of
log_checkpoints. That variable is currently used in these places only:

src/backend/access/transam/xlog.c
80:bool log_checkpoints = false;
7694: * Note: because it is possible for log_checkpoints to change while a
7696: * log_checkpoints is currently off.
7835: if (log_checkpoints)
8035: if (log_checkpoints)
8229: * Note: because it is possible for log_checkpoints to change while a
8231: * log_checkpoints is currently off.
8236: if (log_checkpoints)
8301: if (log_checkpoints)
8305: ereport((log_checkpoints ? LOG : DEBUG2),

src/backend/storage/smgr/md.c
1097: if (log_checkpoints)
1105: if (log_checkpoints && (!INSTR_TIME_IS_ZERO(sync_start)))

I also agree with separating backend write times due to lagging
background writes, and the backend writes "by design", as described by
Jeff. I'm not inclined to worry too much about the usability issues
surrounding the relative values of things like the total checkpoint
time and total write and sync phase times matching up.

What do you intend to do about testing the log_checkpoints = 'off'
case? Can I help in some way? It seems pretty obvious to me that the
changes made are correct, apart from the fact that you omitted to
modify md.c . In particular, there are no pre or post conditions
involved with calling TimeCheckpointEnd() (formerly LogCheckpointEnd()
). This is evident from the following comment at transam/xlog.c:7709:

* Note: because it is possible for log_checkpoints to change while a
* checkpoint proceeds, we always accumulate stats, even if
* log_checkpoints is currently off.

It seems fairly silly to me to not expose this information if it's
invariably being collected anyway.

Cheers
--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment Content-Type Size
publish-sync-stats-v2.patch text/x-patch 19.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2012-02-22 19:12:35 Re: VACUUM ANALYZE is faster than ANALYZE?
Previous Message Kevin Grittner 2012-02-22 18:57:12 Re: VACUUM ANALYZE is faster than ANALYZE?