Re: DROP TABLE can crash the replication sync worker

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: DROP TABLE can crash the replication sync worker
Date: 2021-02-04 11:33:37
Message-ID: CAA4eK1JJ0=rL+LegmgKC3OyUbd3TLrNw7oY-Xq5fmCmR1JAkxg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 4, 2021 at 5:31 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Wed, Feb 3, 2021 at 11:49 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Feb 3, 2021 at 2:53 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> > >
> > > Hi Hackers.
> > >
> > > As discovered in another thread [master] there is an *existing* bug in
> > > the PG HEAD code which can happen if a DROP TABLE is done at same time
> > > a replication tablesync worker is running.
> > >
> > > It seems the table's relid that the sync worker is using gets ripped
> > > out from underneath it and is invalidated by the DROP TABLE. Any
> > > subsequent use of that relid will go wrong.
> > >
> >
> > Where exactly did you pause the tablesync worker while dropping the
> > table? We acquire the lock on the table in LogicalRepSyncTableStart
> > and then keep it for the entire duration of tablesync worker so drop
> > table shouldn't be allowed.
> >
>
> I have a breakpoint set on LogicalRepSyncTableStart. The DROP TABLE is
> done while paused on that breakpoint, so no code of
> LogicalRepSyncTableStart has even executed yet.
>

Fair enough. So, you are hitting this problem in finish_sync_worker()
while logging the message because by that time the relation is
dropped. I think it is good to fix that but we don't want the patch
you have attached here, we can fix it locally in finish_sync_worker()
by constructing a different message (something like: "logical
replication table synchronization worker for subscription \"%s\" has
finished") when we can't get rel_name from rel id. This doesn't appear
to be as serious a problem as we were talking about in the patch
"Allow multiple xacts during table sync in logical replication." [1]
because there we don't hold the lock on the table for the entire
duration tablesync. So, even if we want to fix this problem, it would
be more appropriate for back-branches if we push the patch [1].

[1] - https://www.postgresql.org/message-id/CAHut%2BPtAKP1FoHbUEWN%2Ba%3D8Pg_njsJKc9Zoz05A_ewJSvjX2MQ%40mail.gmail.com

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Nancarrow 2021-02-04 12:05:17 Re: Parallel INSERT (INTO ... SELECT ...)
Previous Message Bharath Rupireddy 2021-02-04 11:28:46 Re: Is Recovery actually paused?