Force the old transactions logs cleanup even if checkpoint is skipped

From: "Zakhlystov, Daniil (Nebius)" <usernamedt(at)nebius(dot)com>
To: "amborodin(at)acm(dot)org" <amborodin(at)acm(dot)org>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: "Mokrushin Mikhail (Nebius)" <rodrijjke(at)nebius(dot)com>
Subject: Force the old transactions logs cleanup even if checkpoint is skipped
Date: 2023-10-17 14:09:21
Message-ID: AM9P190MB12346310F38B3FAF9287D1FFB5D6A@AM9P190MB1234.EURP190.PROD.OUTLOOK.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi, hackers!

I've stumbled into an interesting problem. Currently, if Postgres has nothing to write, it would skip the checkpoint creation defined by the checkpoint timeout setting. However, we might face a temporary archiving problem (for example, some network issues) that might lead to a pile of wal files stuck in pg_wal. After this temporary issue has gone, we would still be unable to archive them since we effectively skip the checkpoint because we have nothing to write.

That might lead to a problem - suppose you've run out of disk space because of the temporary failure of the archiver. After this temporary failure has gone, Postgres would be unable to recover from it automatically and will require human attention to initiate a CHECKPOINT call.

I suggest changing this behavior by trying to clean up the old WAL even if we skip the main checkpoint routine. I've attached the patch that does exactly that.

What do you think?

To reproduce the issue, you might repeat the following steps:

1. Init Postgres:
pg_ctl initdb -D /Users/usernamedt/test_archiver

2. Add the archiver script to simulate failure:
➜  ~ cat /Users/usernamedt/command.sh
#!/bin/bash

false

3. Then alter the PostgreSQL conf:

archive_mode = on
checkpoint_timeout = 30s
archive_command = /Users/usernamedt/command.sh
log_min_messages = debug1

4. Then start Postgres:
/usr/local/pgsql/bin/pg_ctl -D /Users/usernamedt/test_archiver -l logfile start

5. Insert some data:
pgbench -i -s 30 -d postgres

6. Trigger checkpoint to flush all data:
psql -c "checkpoint;"

7. Alter the archiver script to simulate the end of archiver issues:
➜  ~ cat /Users/usernamedt/command.sh
#!/bin/bash

true

8. Check that the WAL files are actually archived but not removed:
➜  ~ ls -lha /Users/usernamedt/test_archiver/pg_wal/archive_status | head
total 0
drwx------@ 48 usernamedt  LD\Domain Users   1.5K Oct 17 17:44 .
drwx------@ 50 usernamedt  LD\Domain Users   1.6K Oct 17 17:43 ..
-rw-------@  1 usernamedt  LD\Domain Users     0B Oct 17 17:42 000000010000000000000040.done
...
-rw-------@  1 usernamedt  LD\Domain Users     0B Oct 17 17:43 00000001000000000000006D.done

2023-10-17 18:03:44.621 +04 [71737] DEBUG:  checkpoint skipped because system is idle

Thanks,

Daniil Zakhlystov

Attachment Content-Type Size
0001-Cleanup-old-files-if-checkpoint-is-skipped.patch application/octet-stream 1.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-10-17 14:23:22 Re: run pgindent on a regular basis / scripted manner
Previous Message Robert Haas 2023-10-17 14:03:54 Re: run pgindent on a regular basis / scripted manner