Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

From: vignesh C <vignesh21(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
Date: 2024-02-01 04:59:14
Message-ID: CALDaNm0hP4J6fbmsNDWBTbb8GxcM=YkB=tjdJbWFTocB8u4MQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 30 Jan 2024 at 17:22, Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Tuesday, January 30, 2024 11:21 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> > wrote:
> > >
> > > On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21(at)gmail(dot)com>
> > wrote:
> > > >
> > > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion(at)gmail(dot)com>
> > wrote:
> > > > >
> > > > > Hello hackers,
> > > > >
> > > > > After determining a possible cause for intermittent failures of
> > > > > the test subscription/031_column_list [1], I was wondering what
> > > > > makes another subscription test (014_binary) fail on the buildfarm:
> > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&d
> > > > > t=20
> > > > > 24-01-22%2001%3A19%3A03
> > > > >
> > > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2
> > > > 02
> > > > > 4-01-14%2018%3A19%3A20
> > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt
> > > > > =202
> > > > > 3-12-21%2001%3A11%3A52
> > > > >
> > > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2
> > > > 02
> > > > > 3-11-27%2001%3A42%3A39
> > > > >
> > > > > All those failures caused by a timeout when waiting for a message
> > > > > expected in _subscriber.log. For example, in the snakefly's case:
> > > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
> > > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
> > > > > incorrect binary data format) at
> > > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin
> > > > ary.pl
> > > > line 269.
> > > > >
> > > > > _subscriber.log contains:
> > > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement:
> > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > > 2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive
> > > > > data from WAL stream: server closed the connection unexpectedly
> > > > > This probably means the server terminated abnormally
> > > > > before or while processing the request.
> > > > > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker
> > > > > "logical replication apply worker" (PID 409999) exited with exit
> > > > > code 1
> > > > > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication
> > > > > apply worker for subscription "tsub" has started ...
> > > > >
> > > > > While _subscriber.log from a successful test run contains:
> > > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement:
> > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
> > > > > session time: 0:00:00.014 user=postgres database=postgres
> > > > > host=[local]
> > > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication
> > > > > apply worker for subscription "tsub" has started
> > > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication
> > > > > table synchronization worker for subscription "tsub", table
> > > > > "test_mismatching_types" has started
> > > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data
> > > > > format
> > > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY
> > > > > test_mismatching_types, line 1, column a
> > > > >
> > > > > In this case, "logical replication apply worker for subscription
> > > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3
> > > > > minutes
> > > > later.
> > > > >
> > > > > I've managed to reproduce this failure locally by running multiple
> > > > > tests in parallel, and my analysis shows that it is caused by a
> > > > > race condition when accessing variable table_states_valid inside
> > tablesync.c.
> > > > >
> > > > > tablesync.c does the following with table_states_valid:
> > > > > /*
> > > > > * Callback from syscache invalidation.
> > > > > */
> > > > > void
> > > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32
> > > > > hashvalue) {
> > > > > table_states_valid = false;
> > > > > }
> > > > > ...
> > > > > static bool
> > > > > FetchTableStates(bool *started_tx) { ...
> > > > > if (!table_states_valid)
> > > > > {
> > > > > ...
> > > > > /* Fetch all non-ready tables. */
> > > > > rstates = GetSubscriptionRelations(MySubscription->oid,
> > > > > true); ...
> > > > > table_states_valid = true;
> > > > > }
> > > > >
> > > > > So, when syscache invalidation occurs inside the code block "if
> > > > > (!table_states_valid)", that invalidation is effectively ignored.
> > > > >
> > > > > In a failed case I observe the following events:
> > > > > 1. logical replication apply worker performs
> > > > > LogicalRepApplyLoop() -> process_syncing_tables() ->
> > > > > process_syncing_tables_for_apply() -> FetchTableStates() periodically.
> > > > >
> > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates
> > syscache
> > > > > for SUBSCRIPTIONRELMAP, and that leads to calling
> > > > > invalidate_syncing_table_states().
> > > > >
> > > > > 3. If the apply worker manages to fetch no non-ready tables in
> > > > > FetchTableStates() and ignore "table_states_valid = false" from
> > > > > invalidate_syncing_table_states(), then it just misses the invalidation
> > > > > event, so it keeps working without noticing non-ready tables
> > appeared as
> > > > > the result of ALTER SUBSCRIPTION
> > > > > (process_syncing_tables_for_apply()
> > > > skips
> > > > > a loop "foreach(lc, table_states_not_ready) ..." until some other event
> > > > > occurs).
> > > > >
> > > > > pg_usleep(100000) added just below GetSubscriptionRelations(...)
> > > > > proves my analysis -- without it, I need tens of iterations (with
> > > > > 50 tests running in
> > > > > parallel) to catch the failure, but with it, I get the failure on
> > > > > the first iteration.
> > > >
> > > > Thanks for the analysis, I was able to reproduce the issue with the
> > > > steps you had shared. I agree with your analysis. I added some logs
> > > > to verify that the invalidation was getting missed.
> > > >
> > > > I felt that this invalidation is getting ignored because we have
> > > > used a boolean variable here, how about changing it slightly so that
> > > > table_states_invalid gets incremented for every invalidation and
> > > > then decrementing table_states_invalid after getting the non-ready
> > > > tables like in the attached patch. I was able to verify that the test passes
> > with the attached patch.
> > >
> > > Thanks for the patch.
> > >
> > > I am not sure if counting the invalidation number is needed, as even
> > > if there are hundreds of invalidations outside of FetchTableStates,
> > > one FetchTableStates call should reset the count to 0 as it is checking the
> > latest catalog.
> >
> > Another approach I was thinking of is to reset table_states_valid immediately in
> > the beginning of FetchTableStates, so any new invalidations will take care of
> > setting table_states_valid again it again which will be handled in the next
> > iteration of fetching non-ready tables like in the attached patch.
>
> I think this is not the standard appraoch for cache building. Because if any
> ERROR happens during the cache building, then we will come into the situation
> that table_states_valid=true while the cache data is invalid. Even if we
> currently don't access these cache after erroring out, but I think we'd better
> to avoid this risk.

Here is an updated patch which changes the boolean variable to a
tri-state enum and set stable state to valid only if no invalidations
have been occurred while the list is being prepared.

Regards,
Vignesh

Attachment Content-Type Size
v3-0001-Table-sync-missed-for-newly-added-tables-because-.patch text/x-patch 3.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-02-01 05:00:00 Re: POC, WIP: OR-clause support for indexes
Previous Message Yugo NAGATA 2024-02-01 04:45:24 Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value