Re: should we enable log_checkpoints out of the box?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: should we enable log_checkpoints out of the box?
Date: 2021-11-02 15:55:23
Message-ID: CA+TgmoZ5ELtnPQxizzZLiq=rk6K2r3neve3mRk+SbwXiY_saeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Oct 31, 2021 at 10:59 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The general policy at the moment is that a normally-functioning server
> should emit *no* log traffic by default (other than a few messages
> at startup and shutdown). log_checkpoints is a particularly poor
> candidate for an exception to that policy, because it would produce so
> much traffic. No DBA would be likely to consider it as anything but
> log spam.

That's absolutely false. On any system where there's anything actually
happening, there's going to be tons of stuff in the log because there
are going to be failed connection attempts, queries that result in
errors, and all kinds of other things like that. By any reasonable
metric, the log volume of log_checkpoints=on is tiny. It can't log
anything more than once per checkpoint interval, which means you're
not even talking about 1 message per minute. You can easily have
hundreds or thousands of errors or other log messages from user
activity *per second* and even on a relatively quiet system that stuff
is going to completely dwarf what you get from log_checkpoints. If
log_checkpoints=on is your biggest source of log output, what you need
isn't so much 'log_checkpoints=off' as 'pg_ctl stop'.

> > It seems the checkpoint stats, that are emitted to server logs when
> > the GUC log_checkpoints is enabled, are so important that a postgres
> > database provider would ever want to disable the GUC.
>
> This statement seems ridiculous on its face. If users need to wait
> with bated breath for a checkpoint report, we have something else
> we need to fix.

Besides appearing to be unwarranted mockery of what Bharath wrote,
this statement also seems to reflect a complete lack of understanding
of what is involved with maintaining a production system. Most of the
time, things just work, so you don't need to look at the logs at all.
But when things do go wrong, then you need some way to figure out what
the problem is. System-wide performance problems not linked to an
individual query are most often caused by either of two things:
checkpoints, and big autovacuum jobs. If you've set up logging for
those things in advance, then if and when the problem happens, you
will have a chance of being able to understand it and solve it
quickly. If you have not, the very first piece of advice you're going
to get from me is to (1) check whether there are any autovacuum
workers still running and if so figure out what they're doing and (2)
set log_checkpoints=on and log_autovacuum_min_duration to the smallest
value that isn't going to fill up your logs with too much garbage.

I can tell you from long experience with this sort of situation that
users do not love it. It means that they often cannot troubleshoot
problems of this type that happened in the past, because there's
simply no information in the log file that is of any use, and they
have to wait for the problem to recur... and I don't think it is
overstating anything to say that some of them probably DO wait with
bated breath, because they'd like whatever the issue is to get fixed!

I think shipping with log_checkpoints=on and
log_autovacuum_min_duration=10m or so would be one of the best things
we could possibly do to allow ex-post-facto troubleshooting of
system-wide performance issues. The idea that users care more about
the inconvenience of a handful of extra log messages than they do
about being able to find problems when they have them matches no part
of my experience. I suspect that many users would be willing to incur
*way more* useless log messages than those settings would ever
generate if it meant that they could actually find problems when and
if they have them. And these messages would in fact be the most
valuable thing in the log for a lot of users. What reasonable DBA
cares more about the fact that the application attempted an insert
that violated a foreign key constraint than they do about a checkpoint
that took 20 minutes to fsync everything? The former is expected; if
you thought that foreign key violations would never occur, you
wouldn't need to incur the expense of having the system enforce them.
The latter is unexpected and basically undiscoverable with default
settings.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gilles Darold 2021-11-02 15:56:10 Re: [PATCH] fix references to like_regex
Previous Message Tom Lane 2021-11-02 15:50:47 Re: [PATCH] fix references to like_regex