make async slave to wait for lsn to be replayed

From: Ivan Kartyshov <i(dot)kartyshov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: make async slave to wait for lsn to be replayed
Date: 2016-08-31 14:16:13
Message-ID: 0240c26c-9f84-30ea-fca9-93ab2df5f305@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

Few days earlier I've finished my work on WAITLSN statement utility, so
I’d like to share it.

Introduction
============

Our clients who deal with 9.5 and use asynchronous master-slave
replication, asked to make the wait-mechanism on the slave side to
prevent the situation when slave handles query which needs data (LSN)
that was received, flushed, but still not replayed.

Problem description
===================

The implementation:
Must handle the wait-mechanism using pg_sleep() in order not to load system
Must avoid race conditions if different backend want to wait for
different LSN
Must not take snapshot of DB, to avoid troubles with sudden minXID change
Must have optional timeout parameter if LSN traffic has stalled.
Must release on postmaster’s death or interrupts.

Implementation
==============

To avoid troubles with snapshots, WAITLSN was implemented as a utility
statement, this allows us to circumvent the snapshot-taking mechanism.
We tried different variants and the most effective way was to use Latches.
To handle interprocess interaction all Latches are stored in shared
memory and to cope with race conditions, each Latch is protected by a
Spinlock.
Timeout was made optional parameter, it is set in milliseconds.

What works
==========

Actually, it works well even with significant timeout or wait period
values, but of course there might be things I've overlooked.

How to use it
==========

WAITLSN ‘LSN’ [, timeout in ms];

#Wait until LSN 0/303EC60 will be replayed, or 10 second passed.
WAITLSN ‘0/303EC60’, 10000;

#Or same without timeout.
WAITLSN ‘0/303EC60’;

Notice: WAITLSN will release on PostmasterDeath or Interruption events
if they come earlier then LSN or timeout.

Testing the implementation
======================

The implementation was tested with testgres and unittest python modules.

How to test this implementation:
Start master server
Make table test, insert tuple 1
Make asynchronous slave replication (9.5 wal_level = standby, 9.6 or
higher wal_level = replica)
Slave: START TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
SELECT * FROM test;
Master: delete tuple + make vacuum + get new LSN
Slave: WAITLSN ‘newLSN’, 60000;
Waitlsn finished with FALSE “LSN doesn`t reached”
Slave: COMMIT;
WAITLSN ‘newLSN’, 60000;
Waitlsn finished with success (without NOTICE message)

The WAITLSN as expected wait LSN, and interrupts on PostmasterDeath,
interrupts or timeout.

Your feedback is welcome!

---
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
waitlsn_10dev.patch text/x-patch 16.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-08-31 14:35:03 Re: pg_sequence catalog
Previous Message Marko Tiikkaja 2016-08-31 14:12:14 Re: INSERT .. SET syntax