Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Date: 2022-02-09 16:00:06
Message-ID: CA+TgmoYst3+34ANsDUO__Ydd3tt7y7-ob+0rU-iurgjg2xQkig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 9, 2022 at 9:19 AM Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Honestly, I never understood why the checkpoint during CREATE DATABASE
> was as problem --- we checkpoint by default every five minutes anyway,
> so why is an additional two a problem --- it just means the next
> checkpoint will do less work. It is hard to see how avoiding
> checkpoints to add WAL writes, fscyncs, and replication traffic could be
> a win.

Try running pgbench with the --progress option and enough concurrent
jobs to keep a moderately large system busy and watching what happens
to the tps each time a checkpoint occurs. It's extremely dramatic, or
at least it was the last time I ran such tests. I think that
performance will sometimes drop by a factor of five or more when the
checkpoint hits, and take multiple minutes to recover.

I think your statement that doing an extra checkpoint "just means the
next checkpoint will do less work" is kind of misleading. That's
certainly true in some situations. But when the same pages are being
dirtied over and over again, an extra checkpoint often means that the
system will do MUCH MORE work, because every checkpoint triggers a new
set of full-page writes over the actively-updated portion of the
database.

I think that very few people run systems with heavy write workloads
with checkpoint_timeout=5m, precisely because of this issue. Almost
every system I see has had that raised to at least 10m and sometimes
30m or more. It can make a massive difference.

> I see the patch justification outlined here:
>
> https://www.postgresql.org/message-id/CAFiTN-sP6yLVTfjR42mEfvFwJ-SZ2iEtG1t0j=QX09X=BM+KWQ@mail.gmail.com
>
> TDE is mentioned as a value for this patch, but I don't see why it is
> needed --- TDE can easily decrypt/encrypt the pages while they are
> copied.

That's true, but depending on what other design decisions we make,
WAL-logging it might be a problem.

Right now, when someone creates a new database, we log a single record
that basically says "go copy the directory'". That's very different
than what we normally do, which is to log changes to individual pages,
or where required, small groups of pages (e.g. a single WAL record is
written for an UPDATE even though it may touch two pages). The fact
that in this case we only log a single WAL record for an operation
that could touch an unbounded amount of data is why this needs special
handling around checkpoints. It also introduces a certain amount of
fragility into the system, because if for some reason the source
directory on the standby doesn't exactly match the source directory on
the primary, the new databases won't match either. Any errors that
creep into the process can be propagated around to other places by a
system like this. However, ordinarily that doesn't happen, which is
why we've been able to use this system successfully for so many years.

The other reason we've been able to use this successfully is that
we're confident that we can perform exactly the same operation on the
standby as we do on the primary knowing only the relevant directory
names. If we say "copy this directory to there" we believe we'll be
able to do that exactly the same way on the standby. Is that still
true with TDE? Well, it depends. If the encryption can be performed
knowing only the key and the identity of the block (database OID,
tablespace OID, relfilenode, fork, block number) then it's true. But
if the encryption needs to, for example, generate a random nonce for
each block, then it's false. If you want the standby to be an exact
copy of the master in a system where new blocks get random nonces,
then you need to replicate the copy block-by-block, not as one
gigantic operation, so that you can log the nonce you picked for each
block. On the other hand, maybe you DON'T want the standby to be an
exact copy of the master. If, for example, you imagine a system where
the master and standby aren't even using the same key, then this is a
lot less relevant.

I can't predict whether PostgreSQL will get TDE in the future, and if
it does, I can't predict what form it will take. Therefore any strong
statement about whether this will benefit TDE or not seems to me to be
pretty questionable - we don't know that it will be useful, and we
don't know that it won't. But, like Dilip, I think the way we're
WAL-logging CREATE DATABASE right now is a hack, and I *know* it can
cause massive performance drops on busy systems.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-02-09 16:01:40 Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Previous Message Dilip Kumar 2022-02-09 15:58:52 Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints