Re: Synchronizing slots from primary to standby

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-02-16 07:42:31
Message-ID: CAA4eK1Jun8SGCoc6JEktxY_+L7GmoJWrdsx-KCEP=GL-SsWggQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 16, 2024 at 11:43 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> Thanks for noticing this. I have pushed all your debug patches. Let's
> hope if there is a BF failure next time, we can gather enough
> information to know the reason of the same.
>

There is a new BF failure [1] after adding these LOGs and I think I
know what is going wrong. First, let's look at standby LOGs:

2024-02-16 06:18:18.442 UTC [241414][client backend][2/14:0] DEBUG:
segno: 4 of purposed restart_lsn for the synced slot, oldest_segno: 4
available
2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] DEBUG:
xmin required by slots: data 0, catalog 741
2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] LOG:mote
could not sync slot information as reslot precedes local slot: remote
slot "lsub1_slot": LSN (0/4000168), catalog xmin (739) local slot: LSN
(0/4000168), catalog xmin (741)

So, from the above LOG, it is clear that the remote slot's catalog
xmin (739) precedes the local catalog xmin (741) which makes the sync
on standby to not complete.

Next, let's look at the LOG from the primary during the nearby time:
2024-02-16 06:18:11.354 UTC [238037][autovacuum worker][5/17:0] DEBUG:
analyzing "pg_catalog.pg_depend"
2024-02-16 06:18:11.360 UTC [238037][autovacuum worker][5/17:0] DEBUG:
"pg_depend": scanned 13 of 13 pages, containing 1709 live rows and 0
dead rows; 1709 rows in sample, 1709 estimated total rows
...
2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/0:0] DEBUG:
Autovacuum VacuumUpdateCosts(db=1, rel=14050, dobalance=yes,
cost_limit=200, cost_delay=2 active=yes failsafe=no)
2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/19:0] DEBUG:
analyzing "information_schema.sql_features"
2024-02-16 06:18:11.377 UTC [238037][autovacuum worker][5/19:0] DEBUG:
"sql_features": scanned 8 of 8 pages, containing 756 live rows and 0
dead rows; 756 rows in sample, 756 estimated total rows

It shows us that autovacuum worker has analyzed catalog table and for
updating its statistics in pg_statistic table, it would have acquired
a new transaction id. Now, after the slot creation, a new transaction
id that has updated the catalog is generated on primary and would have
been replication to standby. Due to this catalog_xmin of primary's
slot would precede standby's catalog_xmin and we see this failure.

As per this theory, we should disable autovacuum on primary to avoid
updates to catalog_xmin values.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-02-16%2006%3A12%3A59

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2024-02-16 08:00:41 Re: Synchronizing slots from primary to standby
Previous Message Bharath Rupireddy 2024-02-16 07:38:45 Re: Improve WALRead() to suck data directly from WAL buffers when possible