Re: Strange decreasing value of pg_last_wal_receive_lsn()

From: godjan • <g0dj4n(at)gmail(dot)com>
To: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
Cc: Sergei Kornilov <sk(at)zsrv(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: Strange decreasing value of pg_last_wal_receive_lsn()
Date: 2020-05-14 02:18:33
Message-ID: D3A6D0DE-A8C7-4E3A-A1B6-406C53662928@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

-> Why do you kill -9 your standby?
Hi, it’s Jepsen test for our HA solution. It checks that we don’t lose data in such situation.

So, now we update logic as Michael said. All ha alive standbys now waiting for replaying all WAL that they have and after we use pg_last_replay_lsn() to choose which standby will be promoted in failover.

It fixed out trouble, but there is one another. Now we should wait when all ha alive hosts finish replaying WAL to failover. It might take a while(for example WAL contains wal_record about splitting b-tree).

We are looking for options that will allow us to find a standby that contains all data and replay all WAL only for this standby before failover.

Maybe you have ideas on how to keep the last actual value of pg_last_wal_receive_lsn()? As I understand WAL receiver doesn’t write to disk walrcv->flushedUpto.

> On 13 May 2020, at 19:52, Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com> wrote:
>
>
> (too bad the history has been removed to keep context)
>
> On Fri, 8 May 2020 15:02:26 +0500
> godjan • <g0dj4n(at)gmail(dot)com> wrote:
>
>> I got it, thank you.
>> Can you recommend what to use to determine which quorum standby should be
>> promoted in such case? We planned to use pg_last_wal_receive_lsn() to
>> determine which has fresh data but if it returns the beginning of the segment
>> on both replicas we can’t determine which standby confirmed that write
>> transaction to disk.
>
> Wait, pg_last_wal_receive_lsn() only decrease because you killed your standby.
>
> pg_last_wal_receive_lsn() returns the value of walrcv->flushedUpto. The later
> is set to the beginning of the segment requested only during the first
> walreceiver startup or a timeline fork:
>
> /*
> * If this is the first startup of walreceiver (on this timeline),
> * initialize flushedUpto and latestChunkStart to the starting point.
> */
> if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
> {
> walrcv->flushedUpto = recptr;
> walrcv->receivedTLI = tli;
> walrcv->latestChunkStart = recptr;
> }
> walrcv->receiveStart = recptr;
> walrcv->receiveStartTLI = tli;
>
> After a primary loss, as far as the standby are up and running, it is fine
> to use pg_last_wal_receive_lsn().
>
> Why do you kill -9 your standby? Whay am I missing? Could you explain the
> usecase you are working on to justify this?
>
> Regards,

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-05-14 02:18:53 Re: SLRU statistics
Previous Message Tom Lane 2020-05-14 01:29:25 Re: Our naming of wait events is a disaster.