Segfault logical replication PG 10.4

From: Mai Peng <maily(dot)peng(at)webedia-group(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: maxence(at)bothorel(dot)net
Subject: Segfault logical replication PG 10.4
Date: 2018-07-11 22:56:34
Message-ID: 4EB4BD78-BFC3-4D04-B8DA-D53DF7160354@webedia-group.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We discovered our pg_wal partition was full few days after setting our first logical publication on a PG 10.4 instance.
Then, we can not synchronise our slave to the master, it triggers a segfault on the slave. We had to drop manually the subscription on slave and the slot on master.
Then, we wanted to find the cause of this bug, stop connection between master and slave , after 30 minutes, the slave had a segfault and could not synchronise.
Why does the slave can not synchronise without a complete creation subscription after dropping the slot?
How to manage the replication, knowing we use cloud vm and issue network latency.

Here the details of conf and error logs:
Conf on master:
max_replication_slots = 10
max_sync_workers_per_subscription = 2
wal_receiver_timeout: 60s
wal_keep_segments : 1000
wal_receiver_status_interval :10
wal_retrieve_retry_interval :5 s
max_logical_replication_workers :4
Conf on slave
same except wal_keep_segments=0

Error log on slave:
LOG: logical replication apply worker for subscription « XXXX" has started
DEBUG: connecting to publisher using connection string "postgresql://USER(at)IP"
LOG: worker process: logical replication worker for subscription 132253 (PID 25359) was terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly co
rrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
DEBUG: unregistering background worker "logical replication worker for subscription 132253"
LOG: database system was interrupted; last known up at 2018-07-11 21:50:56 UTC
DEBUG: checkpoint record is at 0/7DBFEF10
DEBUG: redo record is at 0/7DBFEF10; shutdown TRUE
DEBUG: next transaction ID: 0:93714; next OID: 140237
DEBUG: next MultiXactId: 1; next MultiXactOffset: 0
DEBUG: oldest unfrozen transaction ID: 548, in database 1
DEBUG: oldest MultiXactId: 1, in database 1
DEBUG: commit timestamp Xid oldest/newest: 0/0
DEBUG: transaction ID wrap limit is 2147484195, limited by database with OID 1
DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
DEBUG: starting up replication slots
LOG: recovered replication state of node 2 to 0/0
LOG: recovered replication state of node 3 to 0/0
LOG: recovered replication state of node 4 to 0/0
LOG: recovered replication state of node 5 to 56A5/29ACA918
LOG: database system was not properly shut down; automatic recovery in progress

THANK YOU

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-07-11 22:59:16 Re: TRUNCATE tables referenced by FKs on partitioned tables
Previous Message Tom Lane 2018-07-11 22:52:00 Re: CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?