Re: Strange decreasing value of pg_last_wal_receive_lsn()

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: godjan • <g0dj4n(at)gmail(dot)com>
Cc: Sergei Kornilov <sk(at)zsrv(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: Strange decreasing value of pg_last_wal_receive_lsn()
Date: 2020-06-02 14:11:15
Message-ID: 20200602161115.53928f48@firost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 1 Jun 2020 12:44:26 +0500
godjan • <g0dj4n(at)gmail(dot)com> wrote:

> Hi, sorry for 2 weeks latency in answer :)
>
> >> It fixed out trouble, but there is one another. Now we should wait when all
> >> ha alive hosts finish replaying WAL to failover. It might take a while(for
> >> example WAL contains wal_record about splitting b-tree).
> >
> > Indeed, this is the concern I wrote about yesterday in a second mail on this
> > thread.
>
> Actually, I found out that we use the wrong heuristic to understand that
> standby still replaying WAL. We compare values of pg_last_wal_replay_lsn()
> after and before sleeping. If standby replaying huge wal_record(e.g.
> splitting b-tree) it gave us the wrong result.

It could, yes.

> > Note that when you promote a node, it first replays available WALs before
> > acting as a primary.
>
> Do you know how Postgres understand that standby still replays available WAL?
> I didn’t get it from the code of promotion.

See chapter "26.2.2. Standby Server Operation" in official doc:

«
Standby mode is exited and the server switches to normal operation when
pg_ctl promote is run or a trigger file is found (promote_trigger_file).
Before failover, any WAL immediately available in the archive or in pg_wal
will be restored, but no attempt is made to connect to the master.
»

In the source code, dig around the following chain if interested: StartupXLOG ->
ReadRecord -> XLogReadRecord -> XLogPageRead -> WaitForWALToBecomeAvailable.

[...]

> > Nope, no clean and elegant idea. One your instances are killed, maybe you
> > can force flush the system cache (secure in-memory-only data)?
>
> Do "force flush the system cache” means invoke this command
> https://linux.die.net/man/8/sync <https://linux.die.net/man/8/sync> on the
> standby?

Yes, just for safety.

> > and read the latest received WAL using pg_waldump?
>
> I did an experiment with pg_waldump without sync:
> - write data on primary
> - kill primary
> - read the latest received WAL using pg_waldump:
> 0/1D019F38
> - pg_last_wal_replay_lsn():
> 0/1D019F68

Normal. pg_waldump gives you the starting LSN of the record.
pg_last_wal_replay_lsn() returns lastReplayedEndRecPtr, which is the end of the
record:

/*
* lastReplayedEndRecPtr points to end+1 of the last record successfully
* replayed.

So I suppose your last xlogrecord was 30 bytes long. If I remember correctly,
minimal xlogrecord length is 24 bytes, so I bet there's only one xlogrecord
there, starting at 0/1D019F38 with last byte at 0/1D019F67.

> So it’s wrong to use pg_waldump to understand what was latest received LSN.
> At least without “forcing flush system cache”.

Nope, just sum the xlogrecord length.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2020-06-02 14:22:48 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Previous Message Dmitry Dolgov 2020-06-02 13:40:06 Re: Index Skip Scan