Re: A few new options for CHECKPOINT

From: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A few new options for CHECKPOINT
Date: 2020-11-28 01:50:54
Message-ID: A07F8329-4870-40B2-9991-5016491AA11A@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/27/20, 10:58 AM, "Stephen Frost" <sfrost(at)snowman(dot)net> wrote:
> If you'd like to show that I'm wrong, and it's entirely possible that I
> am, then retry the above with actual load on the system, and also
> actually look at how much outstanding WAL you end up with given the
> different scenarios which has to be replayed during crash recovery.

I did a little experiment to show the behavior I'm referring to. I
used these settings:

checkpoint_completion_target = 0.9
checkpoint_timeout = 30s
max_wal_size = 20GB
WAL segment size is 64MB

I ran the following pgbench command for a few minutes before each
test:

pgbench postgres -T 3600 -c 64 -j 64 -N

For the first test, I killed Postgres just before an automatic, non-
immediate checkpoint completed.

2020-11-28 00:31:57 UTC::@:[51770]:LOG: checkpoint complete...
2020-11-28 00:32:00 UTC::@:[51770]:LOG: checkpoint starting: time

Killed Postgres at 00:32:26 UTC, 29 seconds after latest
checkpoint completed.

2020-11-28 00:32:42 UTC::@:[77256]:LOG: redo starts at 3CF/FD6B8BD0
2020-11-28 00:32:56 UTC::@:[77256]:LOG: redo done at 3D0/C94D1D00

Recovery took 14 seconds and replayed ~3.2 GB of WAL.

postgres=> SELECT pg_wal_lsn_diff('3D0/C94D1D00', '3CF/FD6B8BD0');
pg_wal_lsn_diff
-----------------
3420557616
(1 row)

For the second test, I killed Postgres just after an automatic, non-
immediate checkpoint completed.

2020-11-28 00:41:26 UTC::@:[77475]:LOG: checkpoint complete...

Killed Postgres at 00:41:26 UTC, just after latest checkpoint
completed.

2020-11-28 00:41:42 UTC::@:[8599]:LOG: redo starts at 3D3/152EDD78
2020-11-28 00:41:49 UTC::@:[8599]:LOG: redo done at 3D3/78358A40

Recovery took 7 seconds and replayed ~1.5 GB of WAL.

postgres=> SELECT pg_wal_lsn_diff('3D3/78358A40', '3D3/152EDD78');
pg_wal_lsn_diff
-----------------
1661381832
(1 row)

Granted, I used a rather aggressive checkpoint_timeout, but I think
this demonstrates that waiting for a non-immediate checkpoint to
complete can lower the amount of WAL needed for recovery, even though
it might not lower it as much as waiting for an immediate checkpoint
would.

Nathan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-11-28 02:10:40 Re: POC: postgres_fdw insert batching
Previous Message Tomas Vondra 2020-11-28 01:37:17 Re: [PoC] Non-volatile WAL buffer