RE: Disabled logical replication origin session causes primary key errors

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Masahiko Sawada' <sawada(dot)mshk(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Shawn McCoy <shawn(dot)the(dot)mccoy(at)gmail(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, "drewwcallahan(at)gmail(dot)com" <drewwcallahan(at)gmail(dot)com>, "scott(at)meads(dot)us" <scott(at)meads(dot)us>
Subject: RE: Disabled logical replication origin session causes primary key errors
Date: 2025-04-23 01:41:20
Message-ID: OSCPR01MB149668436272E7ADE05202A1AF5BA2@OSCPR01MB14966.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dear Sawada-san,

Thanks for giving comments!

> +# The bug was that the replication origin wasn’t updated whe
> +# apply_error_callback() was called with elevel >= ERROR, and the apply
> worker
> +# continued running afterward.
>
> I think it would be better to mention the fact that the problem
> happened when an error was caught for instance by a plpgsql function.
> How about rewriting it as follows?
>
> # The bug was that when an ERROR was caught, for instance by a
> PL/pgSQL function,
> # the apply worker reset the replication origin but continued processing
> # subsequent changes. This behavior resulted in a failure to update
> the replication
> # origin during further apply operations.

I tried to describe the internal reasons of bugs, but yours did reported facts.
+1, replaced.

> ---
> +# Define an after-trigger function for the table insert. It can be fired even
> +# by the apply worker and always raises an exception. This situation allows
> +# worker continue after apply_error_callback() is called with elevel = ERROR.
> +$node_subscriber->safe_psql(
> + 'postgres', q{
> +CREATE FUNCTION handle_exception_trigger()
> +RETURNS TRIGGER AS $$
> +BEGIN
> + BEGIN
> + -- Raise an exception
> + RAISE EXCEPTION 'This is a test exception';
> + EXCEPTION
> + WHEN OTHERS THEN
> + RETURN NEW;
> + END;
> +
> + RETURN NEW;
> +END;
> +$$ LANGUAGE plpgsql;
> +});
> +
> +$node_subscriber->safe_psql(
> + 'postgres', q{
> +CREATE TRIGGER silent_exception_trigger
> +AFTER INSERT OR UPDATE ON t1
> +FOR EACH ROW
> +EXECUTE FUNCTION handle_exception_trigger();
> +
> +ALTER TABLE t1 ENABLE ALWAYS TRIGGER silent_exception_trigger;
> +});
>
> How about rewriting the comment as follows?
>
> # Create an AFTER INSERT trigger on the table that raises and subsequently
> # handles an exception. Subsequent insertions will trigger this exception,
> # causing the apply worker to invoke its error callback with an ERROR. However,
> # since the error is caught within the trigger, the apply worker will continue
> # processing changes.

Fixed.

> And can we execute these queries in one safe_psql() call?

Yes possible. I intentionally separated to make it clearer, but I did not have
strong opinions. Fixed.

> ---
> +# Obtain current remote_lsn value to check its advancement later
> +my $remote_lsn = $node_subscriber->safe_psql('postgres',
> + "SELECT remote_lsn FROM
> pg_catalog.pg_replication_origin_status os, pg_catalog.pg_subscription
> s WHERE os.external_id = 'pg_' || s.oid AND s.subname = 'regress_sub'"
> +);
>
> It seems to make sense to me to get the remote_lsn value just before
> executing INSERT after creating the trigger.

Moved.

> Is it a conventional way to always use schema-qualified catalogs names
> in regression tests? Looking at other tests in src/test/subscription,
> there are only three cases:
>
> % git grep pg_catalog src/test/subscription/t
> src/test/subscription/t/001_rep_changes.pl: FROM
> pg_catalog.pg_stat_io
> src/test/subscription/t/020_messages.pl: "SELECT COUNT(*) FROM
> pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND
> active='f'",
> src/test/subscription/t/029_on_error.pl: "SELECT subenabled = false
> FROM pg_catalog.pg_subscription WHERE subname = 'sub'"
>
> ISTM that we don't necessarily need to make the catalog name schema-qualified.

I referred parts of Cluster.pm and 040_standby_failover_slots_sync.pl, and they
had "pg_catalog" prefix. After considering more, instances would be created within
the test and we ensure to connect to them - we can assume they are safe, OK to
remove it.

> ---
> We might want to stop both the publisher and the subscriber at the end
> of the tests.

Opps, added.

Attached patch could pass tests on ,my env, and pgperltidy said OK.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment Content-Type Size
v6-PG16-PG17-0001-Fix-oversight-3f28b2f.patch application/octet-stream 5.4 KB
v6-HEAD-0001-Fix-oversight-3f28b2f.patch application/octet-stream 6.6 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Euler Taveira 2025-04-23 01:57:53 Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
Previous Message Tom Lane 2025-04-22 20:53:23 Re: Command order bug in pg_dump