Race condition in FetchTableStates() breaks synchronization of subscription tables

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Race condition in FetchTableStates() breaks synchronization of subscription tables
Date: 2024-01-26 06:00:00
Message-ID: 711a6afe-edb7-1211-cc27-1bef8239eec7@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

After determining a possible cause for intermittent failures of the test
subscription/031_column_list [1], I was wondering what makes another
subscription test (014_binary) fail on the buildfarm:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-01-22%2001%3A19%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-01-14%2018%3A19%3A20
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2023-12-21%2001%3A11%3A52
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-27%2001%3A42%3A39

All those failures caused by a timeout when waiting for a message expected
in _subscriber.log. For example, in the snakefly's case:
[01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type
timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at
/home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269.

_subscriber.log contains:
2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-12-21 01:17:46.756 UTC [409999] ERROR:  could not receive data from WAL stream: server closed the connection
unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker "logical replication apply worker" (PID 409999) exited with
exit code 1
2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication apply worker for subscription "tsub" has started
...

While _subscriber.log from a successful test run contains:
2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: session time: 0:00:00.014 user=postgres
database=postgres host=[local]
2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication apply worker for subscription "tsub" has started
2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication table synchronization worker for subscription "tsub",
table "test_mismatching_types" has started
2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data format
2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY test_mismatching_types, line 1, column a

In this case, "logical replication apply worker for subscription "tsub" has
started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later.

I've managed to reproduce this failure locally by running multiple tests in
parallel, and my analysis shows that it is caused by a race condition when
accessing variable table_states_valid inside tablesync.c.

tablesync.c does the following with table_states_valid:
/*
 * Callback from syscache invalidation.
 */
void
invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
{
    table_states_valid = false;
}
...
static bool
FetchTableStates(bool *started_tx)
{
...
    if (!table_states_valid)
    {
...
        /* Fetch all non-ready tables. */
        rstates = GetSubscriptionRelations(MySubscription->oid, true);
...
        table_states_valid = true;
    }

So, when syscache invalidation occurs inside the code block
"if (!table_states_valid)", that invalidation is effectively ignored.

In a failed case I observe the following events:
1. logical replication apply worker performs
 LogicalRepApplyLoop() -> process_syncing_tables() ->
 process_syncing_tables_for_apply() -> FetchTableStates() periodically.

2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
 for SUBSCRIPTIONRELMAP, and that leads to calling
 invalidate_syncing_table_states().

3. If the apply worker manages to fetch no non-ready tables in
 FetchTableStates() and ignore "table_states_valid = false" from
 invalidate_syncing_table_states(), then it just misses the invalidation
 event, so it keeps working without noticing non-ready tables appeared as
 the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips
 a loop "foreach(lc, table_states_not_ready) ..." until some other event
 occurs).

pg_usleep(100000) added just below GetSubscriptionRelations(...) proves my
analysis -- without it, I need tens of iterations (with 50 tests running in
parallel) to catch the failure, but with it, I get the failure on the first
iteration.

(Reproduced on REL_16_STABLE..master, where the test 014_binary tries to
"Refresh the publication to trigger the tablesync", but I think the race
condition exists in the previous versions as well.)

[1] https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com

Best regards,
Alexander

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2024-01-26 06:02:54 Re: A performance issue with Memoize
Previous Message Hayato Kuroda (Fujitsu) 2024-01-26 05:51:49 RE: speed up a logical replica setup