BUG #16029: pg_basebackup - At least one WAL file is missing

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: silentx(at)arcor(dot)de
Subject: BUG #16029: pg_basebackup - At least one WAL file is missing
Date: 2019-09-27 12:11:41
Message-ID: 16029-cc360b93eda43c60@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16029
Logged by: Georg H.
Email address: silentx(at)arcor(dot)de
PostgreSQL version: 11.5
Operating system: CentOS Linux release 7.7.1908
Description:

Hello,

I've upgraded 2 clusters (single, no HA / replication except BARMANs
synchronous WAL streaming) as described below from v10.10 to v11.5.
Both clusters are on the same server, using same file systems, having
identical postgresql.conf settings (except port / cluster name / folders)
and have more or less same content (2 development stages)
We take sequential backups of both clusters using BARMAN with backup_method
= postgres (If set to postgres barman will use the pg_basebackup command to
execute the backup).
BARMAN conf for both clusters is also identical and using same PostgreSQL
11.5 binaries out of the same repo as the DB-server uses.

Since the upgrade the backup of one of both clusters is failing 2-3 times a
week with "error=At least one WAL file is missing. The first missing WAL
file is xxx".
In fact the "missing" WAL is archived to the backup server 2 seconds after
"pg_stop_backup complete" (based on postgresql.log).
The second cluster doesn't have this problem at all.
I'm not sure if this is a bug or just a configuration problem.

Noticeable is a difference between backup.info file (end_wal) and backup
history file (STOP WAL LOCATION) contained in the basebackup.
When failing, the backup history file is the one of STOP WAL LOCATION and
the next WAL is in the error marked as missing as it's set as end_wal in
backup.info
e.g. (see full content below):
backup history file is
000000010000000100000083.00000028.backup saying STOP WAL LOCATION:
1/83000130 (file 000000010000000100000083)
and missing WAL = end_wal = 000000010000000100000084

I would expect, that "end_wal = STOP WAL LOCATION", so pointing to the same
WAL as it does when backup succeed.

--backup.info
backup_label=None
begin_offset=40
begin_time=2019-09-26 23:35:02+02:00
begin_wal=000000010000000100000083
begin_xlog=1/83000028
config_file=/pg_data/data/i02mspdev1/postgresql.conf
copy_stats={'total_time': 23.154973, 'copy_time': 23.154973}
deduplicated_size=2038909210
end_offset=96
end_time=2019-09-26 23:35:25.052851+02:00
end_wal=000000010000000100000084
end_xlog=1/84000060
error=At least one WAL file is missing. The first missing WAL file is
000000010000000100000084
hba_file=/pg_data/data/i02mspdev1/pg_hba.conf
ident_file=/pg_data/data/i02mspdev1/pg_ident.conf
included_files=None
mode=postgres
pgdata=/pg_data/data/i02mspdev1
server_name=dev_server_5433
size=2038909210
status=FAILED
tablespaces=[('indexspace', 16400, '/pg_data/indexts/i02mspdev1')]
timeline=1
version=110005
xlog_segment_size=16777216
--backup.info end

--backup history
START WAL LOCATION: 1/83000028 (file 000000010000000100000083)
STOP WAL LOCATION: 1/83000130 (file 000000010000000100000083)
CHECKPOINT LOCATION: 1/83000060
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2019-09-26 23:35:02 CEST
LABEL: pg_basebackup base backup
START TIMELINE: 1
STOP TIME: 2019-09-26 23:35:21 CEST
STOP TIMELINE: 1
--backup history end

--excerpt out of postgresql.log (Level DEBUG1) :
2019-09-26 23:35:01.732 CEST [166017] [unknown] streaming_barman 10.x.x.101
LOG: received replication command: IDENTIFY_SYSTEM
2019-09-26 23:35:02.010 CEST [166018] [unknown] streaming_barman 10.x.x.101
LOG: received replication command: SHOW data_directory_mode
2019-09-26 23:35:02.011 CEST [166018] [unknown] streaming_barman 10.x.x.101
LOG: received replication command: SHOW wal_segment_size
2019-09-26 23:35:02.023 CEST [166018] [unknown] streaming_barman 10.x.x.101
LOG: received replication command: IDENTIFY_SYSTEM
2019-09-26 23:35:02.038 CEST [166018] [unknown] streaming_barman 10.x.x.101
LOG: received replication command: BASE_BACKUP LABEL 'pg_basebackup base
backup' FAST
2019-09-26 23:35:02.057 CEST [97464] DEBUG: performing replication slot
checkpoint
2019-09-26 23:35:05.090 CEST [97468] DEBUG: archived write-ahead log
file "000000010000000100000082"
2019-09-26 23:35:13.901 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: file "pg_internal.init" excluded from backup
2019-09-26 23:35:13.915 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: contents of directory "pg_dynshmem" excluded from backup
2019-09-26 23:35:13.915 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: contents of directory "pg_notify" excluded from backup
2019-09-26 23:35:13.915 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: contents of directory "pg_serial" excluded from backup
2019-09-26 23:35:13.915 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: contents of directory "pg_snapshots" excluded from backup
2019-09-26 23:35:13.915 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: contents of directory "pg_subtrans" excluded from backup
2019-09-26 23:35:13.916 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: file "pg_internal.init" excluded from backup
2019-09-26 23:35:14.060 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: file "pg_internal.init" excluded from backup
2019-09-26 23:35:14.120 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: file "pg_internal.init" excluded from backup
2019-09-26 23:35:17.744 CEST [166068] DEBUG: autovacuum: processing
database "postgres"
2019-09-26 23:35:21.325 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: contents of directory "pg_replslot" excluded from backup
2019-09-26 23:35:21.325 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: contents of directory "pg_stat_tmp" excluded from backup
2019-09-26 23:35:21.421 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: file "postmaster.opts" excluded from backup
2019-09-26 23:35:21.421 CEST [166018] [unknown] streaming_barman 10.x.x.101
DEBUG: file "postmaster.pid" excluded from backup
2019-09-26 23:35:22.682 CEST [97468] DEBUG: archived write-ahead log
file "000000010000000100000083"
2019-09-26 23:35:23.569 CEST [97468] DEBUG: archived write-ahead log
file "000000010000000100000083.00000028.backup"
2019-09-26 23:35:24.449 CEST [166018] [unknown] streaming_barman 10.x.x.101
NOTICE: pg_stop_backup complete, all required WAL segments have been
archived
2019-09-26 23:35:25.547 CEST [166098] postgres barman 10.x.x.101 LOG:
restore point "barman_20190926T233501" created at 1/85000090
2019-09-26 23:35:25.547 CEST [166098] postgres barman 10.x.x.101 STATEMENT:
SELECT pg_create_restore_point('barman_20190926T233501')
2019-09-26 23:35:26.201 CEST [97468] DEBUG: archived write-ahead log
file "000000010000000100000084"

OS:
CentOS Linux release 7.7.1908 (Core) on a virtual machine (on a HyperV host)
with 12 GB RAM
uname -a:
Linux dev_server 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC
2019 x86_64 x86_64 x86_64 GNU/Linux

We are using
PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623
(Red Hat 4.8.5-36), 64-bit

installed from repo
https://download.postgresql.org/pub/repos/yum/11/redhat/rhel-$releasever-$basearch

Upgraded from PostgreSQL 10.10 on x86_64-pc-linux-gnu, compiled by gcc (GCC)
4.8.5 20150623 (Red Hat 4.8.5-36), 64-bit
using pg_dumpall -> initdb -> psql load of dump (psql -f) -> vacuumdb -a -f
-z
initdb:
/usr/pgsql-11/bin/initdb --pgdata=/pg_data/data/i02mspdev1/
--waldir=/pg_data/wals/i02mspdev01/11 --pwprompt --encoding=UTF8
--locale=en_US.UTF-8 --data-checksums

barman -v
2.9
Barman by 2ndQuadrant (www.2ndQuadrant.com)
server config: path_prefix = "/usr/pgsql-11/bin"

--changed postgresql.conf settings:

1 "listen_addresses" "*"
2 "port" "5433"
3 "max_connections" "120"
4 "superuser_reserved_connections" "5"
5 "unix_socket_directories" "/var/run/modispostgresql, /tmp"
6 "tcp_keepalives_idle" "4"
7 "tcp_keepalives_interval" "4"
8 "tcp_keepalives_count" "2"
9 "shared_buffers" "2GB"
10 "temp_buffers" "512MB"
11 "work_mem" "16MB"
12 "maintenance_work_mem" "512MB"
13 "dynamic_shared_memory_type" "posix"
14 "effective_io_concurrency" "200"
15 "wal_level" "replica"
16 "wal_sync_method" "fdatasync"
17 "wal_compression" "on"
18 "wal_buffers" "16MB"
19 "max_wal_size" "2GB"
20 "min_wal_size" "1GB"
21 "checkpoint_completion_target" "0.9"
22 "archive_mode" "on"
23 "archive_command" "barman-wal-archive barman.server dev_server_5433 %p &&
test ! -f /pg_data/walarch/i02mspdev1/11/%f && cp %p
/pg_data/walarch/i02mspdev1/11/%f "
24 "wal_keep_segments" "4"
25 "track_commit_timestamp" "on"
26 "synchronous_standby_names" "barman_receive_wal"
27 "hot_standby" "off"
28 "random_page_cost" "1.0"
29 "min_parallel_table_scan_size" "4MB"
30 "from_collapse_limit" "12"
31 "join_collapse_limit" "16"
32 "log_destination" "stderr"
33 "logging_collector" "on"
34 "log_directory" "log"
35 "log_filename" "postgresql-%Y-%m-%d_%H%M%S.log"
36 "log_truncate_on_rotation" "on"
37 "log_rotation_age" "1d"
38 "log_rotation_size" "10MB"
39 "syslog_ident" "postgres_i02mspdev1"
40 "log_min_messages" "debug1"
41 "log_line_prefix" "%m [%p] %d %u %h "
42 "log_lock_waits" "on"
43 "log_statement" "ddl"
44 "log_replication_commands" "on"
45 "log_timezone" "Europe/Berlin"
46 "cluster_name" "i02mspdev1"
47 "track_functions" "pl"
48 "temp_tablespaces" "/pg_tempts/i02mspdev1"
49 "datestyle" "iso, mdy"
50 "timezone" "Europe/Berlin"
51 "lc_messages" "en_US.UTF-8"
52 "lc_monetary" "en_US.UTF-8"
53 "lc_numeric" "en_US.UTF-8"
54 "lc_time" "en_US.UTF-8"
55 "default_text_search_config" "pg_catalog.english"
56 "shared_preload_libraries" "plugin_debugger,pg_stat_statements"

kind regards
Georg

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Bruce Momjian 2019-09-27 16:47:46 Re: BUG #15991: Troubles about Management Tools for Postgre.
Previous Message Thomas Munro 2019-09-27 01:36:15 Re: BUG #16026: default_tablespace in postgresql.conf is used instead of the database's default.