Re: Strange decreasing value of pg_last_wal_receive_lsn()

From: godjan • <g0dj4n(at)gmail(dot)com>
To: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
Cc: Sergei Kornilov <sk(at)zsrv(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: Strange decreasing value of pg_last_wal_receive_lsn()
Date: 2020-06-01 07:44:26
Message-ID: 6E9C1C03-F7A5-49D2-ABE9-98D12C4804AF@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, sorry for 2 weeks latency in answer :)

>> It fixed out trouble, but there is one another. Now we should wait when all
>> ha alive hosts finish replaying WAL to failover. It might take a while(for
>> example WAL contains wal_record about splitting b-tree).
>
> Indeed, this is the concern I wrote about yesterday in a second mail on this
> thread.

Actually, I found out that we use the wrong heuristic to understand that standby still replaying WAL.
We compare values of pg_last_wal_replay_lsn() after and before sleeping.
If standby replaying huge wal_record(e.g. splitting b-tree) it gave us the wrong result.

> Note that when you promote a node, it first replays available WALs before
> acting as a primary.

Do you know how Postgres understand that standby still replays available WAL?
I didn’t get it from the code of promotion.

> However, how useful is this test in a context of auto failover for
> **service** high availability?

Such a situation has a really low probability in our service. And we thought that it could be okay to resolve such a situation with on-call participation.

> Nope, no clean and elegant idea. One your instances are killed, maybe you can
> force flush the system cache (secure in-memory-only data)?

Do "force flush the system cache” means invoke this command https://linux.die.net/man/8/sync <https://linux.die.net/man/8/sync> on the standby?

> and read the latest received WAL using pg_waldump?

I did an experiment with pg_waldump without sync:
- write data on primary
- kill primary
- read the latest received WAL using pg_waldump:
0/1D019F38
- pg_last_wal_replay_lsn():
0/1D019F68

So it’s wrong to use pg_waldump to understand what was latest received LSN. At least without “forcing flush system cache”.

> If all your nodes are killed in the same
> disaster, how/why an automatic cluster manager should take care of starting all
> nodes again and pick the right node to promote?

1. How?
The automatic cluster manager will restart standbys in such a situation.
If the primary lock in ZK is released automatic cluster manager start process of election new primary.
To understand which node should be promoted automatic cluster manager should get LSN of the last wal_record wrote on disk by each potential new primary.
We used pg_last_wal_receive_lsn() for it but it was a mistake. Because after "kill -9” on standby pg_last_wal_receive_lsn() reports first lsn of last segment.

2. Why?
- sleepy on-call in a night can make something bad in such situation)
- pg_waldump didn’t give the last LSN wrote on disk(at least without forcing flush the system cache) so I don’t know how on-call can understand which standby should be promoted
- automatic cluster manager successfully resolve such situations in clusters with one determined synchronous standby for years, and we hope it’s possible to do it in clusters with quorum replication

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-06-01 08:33:12 Re: OpenSSL 3.0.0 compatibility
Previous Message Masahiko Sawada 2020-06-01 06:58:31 Re: Internal key management system