Pgbackrest failure for INCR and DIFF but not FULL backup

From: KK CHN <kkchn(dot)in(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Pgbackrest failure for INCR and DIFF but not FULL backup
Date: 2025-05-07 11:17:29
Message-ID: CAKgGyB-8ap3-zna_gkpEQNfzgWzu4QwO5Y6U_-2q1JcjK=UTvw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi folks,

I am facing a strange issue, Pgbackrest backup fails for DIFF or INCR
backups but not Full backup, with the * error WAL file cannot be archived
before 60000 ms timeout.*

The pgbackrest " *stanza check* " command *sometimes succeeds, but
sometimes fails.*

I don't know why * PG is unable to* *copy WAL files from pg_wal to
/data/myarchive_di*r* in real time*. I always observed a delay of a few
minutes for a wal file from pg_wal to appear in /data/my_archive_dir.

I'hv observed in the postgresql.conf (checkpoint_timeout = 5 m,
max_wal_size = 16 GB, wal_keep_size=15GB, min_wal_size=80MB etc.)

and I found pg_wal dir size is 16 GB on disk (du -h -d 1 ) and
/data/archive = 1.2 T
/dev/mapper/rhel_bcga68-data 5.0T 1.8T 3.3T 35% /data

Can we suspect the 5 M is the reason for the WAL archiving delay ?

Backup to a remote RepoServer for INCR or DIFF backup always fails, I
found a full backup always succeeds!!!

What is the ideal value needed to be set for "*checkpoint_timeout*" ?
*Or this doesn't have any impact on pgbackrest failure ?*

*archive_command = 'pgbackrest --stanza=My_Repo archive-push %p && cp %p
/data/archive/%f' *

From postgresql logs I am seeing this ..

ERROR: [082]: unable to push WAL file '000000010000026300000002' to the
archive asynchronously after 60 second(s)
HINT: check '/var/log/pgbackrest/My_Repo-archive-push-async.log' for
errors.
INFO: archive-push command end: aborted with exception [082]
2025-05-02 12:15:17 IST LOG: archive command failed with exit code 82
2025-05-02 12:15:17 IST DETAIL: The failed archive command was: pgbackrest
--stanza=My_Repo archive-push pg_wal/000000010000026300000002 && cp
pg_wal/000000010000026300000002 /data/archive/000000010000026300000002
INFO: archive-push command begin 2.52.1: [pg_wal/000000010000026300000002]
--archive-async --compress-type=zst --exec-id=2848559-384cf49c
--log-level-console=info --log-level-file=debug --log-level-stderr=info
--pg1-path= /var/lib/postgres/16/data --pg-version-force=16
--process-max=6 --repo1-host=10.x.y.202 --repo1-host-user=pgbackrest
--spool-path=/var/spool/pgbackrest --stanza=My_Repo

top output on DB cluster:

top - 12:37:00 up 66 days, 17:24, 2 users, load average: 4.04, 4.72, 4.56

Tasks: 902 total, 4 running, 897 sleeping, 0 stopped, 1 zombie
%Cpu(s): 7.4 us, 1.7 sy, 0.0 ni, 89.9 id, 0.4 wa, 0.2 hi, 0.4 si,
0.0 st
MiB Mem : 31837.6 total, 706.1 free, 15243.0 used, 24741.0 buff/cache
MiB Swap: 8060.0 total, 6634.0 free, 1426.0 used. 16608.9 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
2839363 postgre+ 20 0 8965608 7.2g 7.1g S 70.2 23.0 2:02.61
postgres
2864108 postgre+ 20 0 8967848 7.1g 7.1g S 64.9 22.8 0:30.04
postgres
2865547 postgre+ 20 0 8965432 7.1g 7.1g S 39.1 22.8 0:32.30
postgres
2865752 postgre+ 20 0 8964352 6.9g 6.9g S 16.6 22.3 0:32.94
postgres

Model name: Intel(R) Xeon(R) Gold 6430
BIOS Model name: Intel(R) Xeon(R) Gold 6430
CPU family: 6
Model: 143
Thread(s) per core: 1
Core(s) per socket: 16

These are vCPUs (16 nos) , OS RHEL 9, postgres 16, pgbackrest 2.52.1

*Any hints most welcome to find the root cause / troubleshoot
the pgbackrest failures for DIFF/ INCR backups.*

Thank you
Krishane

*For More Inputs: *
For more info : you can see the full backup success here. but a diff
backup fails.

full backup: 20250505-070204F
timestamp start/stop: 2025-05-05 07:02:04+05:30 / 2025-05-05
22:11:23+05:30
wal start/stop: 000000010000026F00000066 /
000000010000027300000045
database size: 503.1GB, database backup size: 503.1GB
repo1: backup size: 79GB

When I try diff backup it always fails.

[root(at)Repo ~]# tail -f /var/log/pgbackrest/My_Repo-backup.log

stack trace:

command/archive/find.c:walSegmentFind:191:(this: {WalSegmentFind},
walSegment: {"000000010000027B0000006C"})

command/backup/backup.c:backupArchiveCheckCopy:(backupData: {BackupData},
manifest: {Manifest})
command/backup/backup.c:cmdBackup:(void)
main.c:main:(debug log level required
for parameters)

--------------------------------------------------------------------
2025-05-07 15:47:49.760 P00 INFO: backup command end: aborted with
exception [082]
2025-05-07 15:47:49.760 P00 DEBUG: command/exit::exitSafe: => 82
2025-05-07 15:47:49.860 P00 DEBUG: main::main: => 82
^C
[ root(at)Repo ~ ~]# date
Wednesday 07 May 2025 04:06:37 PM IST

*The postgres log says*

=316781-61d82f85 --log-level-console=info --log-level-file=debug
--log-level-stderr=info --pg1-path=
/var/lib/postgres/16/data --pg-version-force=16 --process-max=3
--repo1-host=10.x.y.202 --repo1-host-user=pgbackrest
--spool-path=/var/spool/pgbackrest --stanza=My_Repo
INFO: pushed WAL file '000000010000027B00000082' to the archive
asynchronously
INFO: archive-push command end: completed successfully (37506ms)
INFO: archive-push command begin 2.52.1: [pg_wal/000000010000027B00000083]
--archive-async --compress-type=zst --exec-id=317334-27a30b57
--log-level-console=info --log-level-file=debug --log-level-stderr=info
--pg1-path= /var/lib/postgres/16/data --pg-version-force=16
--process-max=3 --repo1-host=10.x.y.202 --repo1-host-user=pgbackrest
--spool-path=/var/spool/pgbackrest --stanza=My_Repo
ERROR: [082]: unable to push WAL file '000000010000027B00000083' to the
archive asynchronously after 60 second(s)
HINT: check '/var/log/pgbackrest/My_Repo-archive-push-async.log' for
errors.
INFO: archive-push command end: aborted with exception [082]
2025-05-07 16:22:56 IST LOG: archive command failed with exit code 82

[ root(at)Repo ~ ~]#

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alicja Kucharczyk 2025-05-07 12:16:49 Re: huge_pages=on cause could not map anonymous shared memory: Cannot allocate memory
Previous Message Bogdan Siara 2025-05-07 11:00:18 huge_pages=on cause could not map anonymous shared memory: Cannot allocate memory