Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

From: japin <japinli(at)hotmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION
Date: 2021-01-13 09:05:51
Message-ID: MEYP282MB1669087B084BDE0C1BD6C525B6A90@MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Wed, 13 Jan 2021 at 16:49, Bharath Rupireddy wrote:
> On Wed, Jan 13, 2021 at 11:27 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Wed, Jan 13, 2021 at 11:08 AM Bharath Rupireddy
>> <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>> >
>> > On Wed, Jan 13, 2021 at 10:33 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> > >
>> > > On Tue, Jan 12, 2021 at 5:23 PM japin <japinli(at)hotmail(dot)com> wrote:
>> > > >
>> > > > On Tue, 12 Jan 2021 at 19:32, Bharath Rupireddy wrote:
>> > > > > On Tue, Jan 12, 2021 at 4:47 PM Li Japin <japinli(at)hotmail(dot)com> wrote:
>> > > > >> IIUC the logical replication only replicate the tables in publication, I think
>> > > > >> when the tables that aren't in publication should not be replicated.
>> > > > >>
>> > > > >> Attached the patch that fixes it. Thought?
>> > > > >
>> > > > > With that change, we don't get the behaviour that's stated in the
>> > > > > document - "The ADD TABLE and DROP TABLE clauses will add and remove
>> > > > > one or more tables from the publication. Note that adding tables to a
>> > > > > publication that is already subscribed to will require a ALTER
>> > > > > SUBSCRIPTION ... REFRESH PUBLICATION action on the subscribing side in
>> > > > > order to become effective" -
>> > > > > https://www.postgresql.org/docs/devel/sql-alterpublication.html.
>> > > > >
>> > > >
>> > > > The documentation only emphasize adding tables to a publication, not
>> > > > include dropping tables from a publication.
>> > > >
>> > >
>> > > Right and I think that is ensured by the subscriber by calling
>> > > should_apply_changes_for_rel() which won't return true unless the
>> > > newly added relation is not synced via Refresh Publication. So, this
>> > > means with or without this patch we should be sending the changes of
>> > > the newly published table from the publisher?
>> >
>> > Oh, my bad, alter subscription...refresh publication is required only when the tables are added to the publisher. Patch by japin makes the walsender process to stop sending the data to the subscriber/apply worker. The patch is based on the idea of looking at the PUBLICATIONRELMAP in get_rel_sync_entry when the entries have been invalidated in rel_sync_cache_publication_cb because of alter publication...drop table.
>> > ,
>> > When the alter subscription...refresh publication is run on the subscriber, the SUBSCRIPTIONRELMAP catalogue gets invalidated but the corresponding cache entries in the LogicalRepRelMap which is used by logicalrep_rel_open are not invalidated. LogicalRepRelMap is used to know the relations that are associated with the subscription. But we seem to have not taken care of invalidating those entries, though we have the invalidation callback invalidate_syncing_table_states registered for SUBSCRIPTIONRELMAP in ApplyWorkerMain. So, we miss to work on updating the entries in LogicalRepRelMap.
>> >
>> > IMO, the ideal way to fix this issue is 1) stop the walsender sending the changes to dropped tables, for this japin patch works 2) we must mark all the LogicalRepRelMap entries as invalid in invalidate_syncing_table_states so that in the next logicalrep_rel_open, if the entry is invalidated, then we have to call GetSubscriptionRelState to get the latest state, as shown in [1]. Likewise, we might also have to mark the cache entries invalid in subscription_change_cb which is invalidation callback for pg_subscription
>> >
>>
>> Is the second point you described here is related to the original
>> bug-reported or will it cause some other problem?
>
> It can cause some other problems. I found this while investigating
> from the subscriber perspective why the subscriber is accepting the
> inserts even though the relation is removed from the
> pg_subscription_rel catalogue after refresh publication. I ended up
> finding that the cache entries are not being invalidated in
> invalidate_syncing_table_states.
>
> Having said that, the patch proposed by japin is enough to solve the
> bug reported here.
>
> While we are fixing the bug, I thought it's better to fix this as
> well, maybe as a 0002 patch? If okay, I can work on the patch and post
> it in a separate thread?
>
>> > Thoughts?
>> >
>> > [1] -
>> > if (entry->state != SUBREL_STATE_READY || entry->invalid)
>> > entry->state = GetSubscriptionRelState(MySubscription->oid,
>> > entry->localreloid,
>> > &entry->statelsn);
>> >
>> > if (entry->invalid)
>> > entry->invalid = false;
>> >
>> > return entry;
>> >
>> > > I have another question on your patch which is why in some cases like
>> > > when we have not inserted in step-5 (as mentioned by you) the
>> > > following insertions are not sent. Is somehow we are setting the
>> > > pubactions as false in that case, if so, how?
>> >
>> > The reason is that the issue reported in this thread occurs - when we have the walsender process running, RelationSyncCache is initialized, we inserted some data into the table that's sent to the subscriber and the table is dropped, we miss to set the pubactions to false in get_rel_sync_entry, though the cache entries have been invalidated.
>> >
>> > In some cases, it works properly because the old walsender process was stopped and when a new walsender is started, then the cache RelationSyncCache gets initialized again and the pubactions will be set to false in get_rel_sync_entry.
>> >
>>
>> Why is walsender process was getting stopped in one case but not in another?
>
> Both walsender and logical replication apply workers are getting
> stopped because of the default values of 60s for wal_sender_timeout
> and wal_receiver_timeout respectively. To analyze further, I increased
> both timeouts to 600s.
>
> Here's what happening:
>
> Note that for the sake of simplicity, I'm skipping the create
> publication, subscription, initial insertion statements, I'm starting
> from alter publication ... drop table statement.
>
> case 1 - issue reported in this thread is observed:
> 1) on publisher - alter publication ... drop table t1; and insert into
> t1 values(67); Though the table is dropped from the publication, the
> walsender process publishes the inserts, which will be fixed by the
> patch proposed by japin.
>
> 2) on subscriber - on the insert into t1(1) on the publisher,
> logicalrep_relmap_update gets called in the subscriber because the
> publisher informs the subscriber that it has dropped the table from
> the publication. In logicalrep_relmap_update, the LogicalRepRelMap
> cache entry corresponding to the dropped relation is reset. Since the
> alter subscription ... refresh publication is not yet run, the insert
> is taken to the target table and the LogicalRepRelMap cache entry for
> that table is updated in logicalrep_rel_open, it's state is read from
> the catalogues using GetSubscriptionRelState, because of the entry
> reset in logicalrep_relmap_update, entry->state is
> SUBREL_STATE_UNKNOWN. After GetSubscriptionRelState, entry->state
> becomes SUBREL_STATE_READY
> if (entry->state != SUBREL_STATE_READY)
> entry->state = GetSubscriptionRelState(MySubscription->oid,
> entry->localreloid,
> &entry->statelsn);
>
> 3) on subscriber - run alter subscription ... refresh publication, the
> dropped relation entry is removed from the pg_subscription_rel
> catalogue and invalidate_syncing_table_states gets called. But note
> that we have not invalidated the LogicalRepRelMap. The
> LogicalRepRelMap cache entry for the table remains what we have set in
> (2) i.e. entry->state is SUBREL_STATE_READY.
>
> 4) on publisher - insert into t1 values(68); as mentioned in (1),
> walsender process keeps sending the inserts.
>
> 5) on subscriber - apply_handle_insert is called and the
> logicalrep_rel_open will not call GetSubscriptionRelState to get the
> new catalogue entry for pg_subscription_rel (3), because the
> entry->state is still SUBREL_STATE_READY which was set (2), so it ends
> up in applying the incoming inserts, until the GetSubscriptionRelState
> is called for that entry which happens either the apply worker stops
> and restarts due to timeout or another change to the publication
> happens in the publisher so that logicalrep_relmap_update is called on
> the subscriber. If we had marked all the LogicalRepRelMap cache
> entries as invalid in the invalidate_syncing_table_states callback as
> pointed out by me in the earlier mail, then this problem will be
> solved.
>
> case 2 - issue is not observed:
> 1) on publisher - alter publication ... drop table t1; no inserts into
> the table t1, subscriber will not receive logicalrep_relmap_update
> yet.
>
> 2) on subscriber - run alter subscription ... refresh publication, the
> dropped relation entry is removed from the pg_subscription_rel
> catalogue and invalidate_syncing_table_states gets called. But note
> that we have not invalidated the LogicalRepRelMap. The
> LogicalRepRelMap cache entry for the table remains the same.
>
> 3) on publisher - insert into t1 values(67); Though the table is
> dropped from the publication (1), the walsender process publishes the
> inserts, which will be fixed by the patch proposed by japin.
>
> 4) on subscriber - on the insert into t1(3) on the publisher,
> logicalrep_relmap_update gets called in the subscriber because the
> publisher informs the subscriber that it has dropped the table from
> the publication. In logicalrep_relmap_update, the LogicalRepRelMap
> cache entry corresponding to the dropped relation is reset, because of
> which the next logicalrep_rel_open will call GetSubscriptionRelState
> (as entry->state is SUBREL_STATE_UNKNOWN due to the reset in
> logicalrep_relmap_update). GetSubscriptionRelState will return
> SUBREL_STATE_UNKNOWN, because it doesn't find the tuple in the updated
> pg_subscription_rel catalogue because of the alter subscription ...
> refresh publication would have removed tuple related to the table from
> the pg_subscription_rel. So, the inserts are ignored because the
> should_apply_changes_for_rel returns false after the
> logicalrep_rel_open in apply_handle_insert.
> /* Try finding the mapping. */
> tup = SearchSysCache2(SUBSCRIPTIONRELMAP,
> ObjectIdGetDatum(relid),
> ObjectIdGetDatum(subid));
>
> if (!HeapTupleIsValid(tup))
> {
> *sublsn = InvalidXLogRecPtr;
> return SUBREL_STATE_UNKNOWN;
> }
>
> 5) Here on, all further inserts into the publication table are ignored
> by the subscriber because of the same reason (4). So, the issue
> reported in this thread doesn't occur.
>
> I'm sorry for the huge write up. I hope I clarified the point this time.
>

Yes! Very clearly.

> In summary, I feel we need to fix the publisher sending the inserts
> even though the table is dropped from the publication, that is the
> patch patch proposed by japin. This solves the bug reported in this
> thread.
>
> And also, it's good to have the LogicalRepRelMap invalidation fix as a
> 0002 patch in invalidate_syncing_table_states, subscription_change_cb
> and logicalrep_rel_open as proposed by me.
>
> Thoughts?
>

I think invalidate the LogicalRepRelMap is necessary. If the table isn't in
subscription, can we remove the LogicalRepRelMapEntry from LogicalRepRelMap?

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2021-01-13 09:15:05 Re: POC: postgres_fdw insert batching
Previous Message Amit Kapila 2021-01-13 08:50:06 Re: Single transaction in the tablesync worker?