Re: Minimal logical decoding on standbys

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minimal logical decoding on standbys
Date: 2019-04-03 14:27:33
Message-ID: CAJ3gD9dnEn6fKXCwdCZF6HGC15oMKaAqEas6_v_psYk0HCZwmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2 Apr 2019 at 21:34, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2019-04-02 15:26:52 +0530, Amit Khandekar wrote:
> > On Thu, 14 Mar 2019 at 15:00, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> > > I managed to get a recovery conflict by :
> > > 1. Setting hot_standby_feedback to off
> > > 2. Creating a logical replication slot on standby
> > > 3. Creating a table on master, and insert some data.
> > > 2. Running : VACUUM FULL;
> > >
> > > This gives WARNING messages in the standby log file.
> > > 2019-03-14 14:57:56.833 IST [40076] WARNING: slot decoding_standby w/
> > > catalog xmin 474 conflicts with removed xid 477
> > > 2019-03-14 14:57:56.833 IST [40076] CONTEXT: WAL redo at 0/3069E98
> > > for Heap2/CLEAN: remxid 477
> > >
> > > But I did not add such a testcase into the test file, because with the
> > > current patch, it does not do anything with the slot; it just keeps on
> > > emitting WARNING in the log file; so we can't test this scenario as of
> > > now using the tap test.
> >
> > I am going ahead with drop-the-slot way of handling the recovery
> > conflict. I am trying out using ReplicationSlotDropPtr() to drop the
> > slot. It seems the required locks are already in place inside the for
> > loop of ResolveRecoveryConflictWithSlots(), so we can directly call
> > ReplicationSlotDropPtr() when the slot xmin conflict is found.
>
> Cool.
>
>
> > As explained above, the only way I could reproduce the conflict is by
> > turning hot_standby_feedback off on slave, creating and inserting into
> > a table on master and then running VACUUM FULL. But after doing this,
> > I am not able to verify whether the slot is dropped, because on slave,
> > any simple psql command thereon, waits on a lock acquired on sys
> > catache, e.g. pg_authid. Working on it.
>
> I think that indicates a bug somewhere. If replay progressed, it should
> have killed the slot, and continued replaying past the VACUUM
> FULL. Those symptoms suggest replay is stuck somewhere. I suggest a)
> compiling with WAL_DEBUG enabled, and turning on wal_debug=1, b) looking
> at a backtrace of the startup process.

Oops, it was my own change that caused the hang. Sorry for the noise.
After using wal_debug, found out that after replaying the LOCK records
for the catalog pg_auth, it was not releasing it because it had
actually got stuck in ReplicationSlotDropPtr() itself. In
ResolveRecoveryConflictWithSlots(), a shared
ReplicationSlotControlLock was already held before iterating through
the slots, and now ReplicationSlotDropPtr() again tries to take the
same lock in exclusive mode for setting slot->in_use, leading to a
deadlock. I fixed that by releasing the shared lock before calling
ReplicationSlotDropPtr(), and then re-starting the slots' scan over
again since we released it. We do similar thing for
ReplicationSlotCleanup().

Attached is a rebased version of your patch
logical-decoding-on-standby.patch. This v2 version also has the above
changes. It also includes the tap test file which is still in WIP
state, mainly because I have yet to add the conflict recovery handling
scenarios.

I see that you have already committed the
move-latestRemovedXid-computation-for-nbtree-xlog related changes.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
logical-decoding-on-standby_v2.patch application/octet-stream 38.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2019-04-03 14:43:33 Re: [PATCH v20] GSSAPI encryption support
Previous Message Robert Haas 2019-04-03 14:26:22 Re: minimizing pg_stat_statements performance overhead