Re: Skipping logical replication transactions on subscriber side

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Greg Nancarrow <gregn4422(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, Alexey Lesovsky <lesovsky(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Skipping logical replication transactions on subscriber side
Date: 2021-11-17 06:52:05
Message-ID: CALDaNm1kDoB8TigAHUJ01s17AhU5DznBfNK2ywCedi95DvNFgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 16, 2021 at 12:01 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, Nov 15, 2021 at 11:43 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Mon, Nov 15, 2021 at 2:48 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Nov 15, 2021 at 4:49 PM Greg Nancarrow <gregn4422(at)gmail(dot)com> wrote:
> > > >
> > > > On Mon, Nov 15, 2021 at 1:49 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > >
> > > > > I've attached an updated patch that incorporates all comments I got so
> > > > > far. Please review it.
> > > > >
> > > >
> > > > Thanks for the updated patch.
> > > > A few minor comments:
> > > >
> > > > doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > > >
> > > > (1) tab in doc updates
> > > >
> > > > There's a tab before "Otherwise,":
> > > >
> > > > + copy of the relation with <parameter>relid</parameter>.
> > > > Otherwise,
> > >
> > > Fixed.
> > >
> > > >
> > > > src/backend/utils/adt/pgstatfuncs.c
> > > >
> > > > (2) The function comment for "pg_stat_reset_subscription_worker_sub"
> > > > seems a bit long and I expected it to be multi-line (did you run
> > > > pg_indent?)
> > >
> > > I ran pg_indent on pgstatfuncs.c but it didn't become a multi-line comment.
> > >
> > > >
> > > > src/include/pgstat.h
> > > >
> > > > (3) Remove PgStat_StatSubWorkerEntry.dbid?
> > > >
> > > > The "dbid" member of the new PgStat_StatSubWorkerEntry struct doesn't
> > > > seem to be used, so I think it should be removed.
> > > > (I could remove it and everything builds OK and tests pass).
> > > >
> > >
> > > Fixed.
> > >
> > > Thank you for the comments! I've updated an updated version patch.
> >
> > Thanks for the updated patch.
> > I found one issue:
> > This Assert can fail in few cases:
> > +void
> > +pgstat_report_subworker_error(Oid subid, Oid subrelid, Oid relid,
> > +
> > LogicalRepMsgType command, TransactionId xid,
> > + const char *errmsg)
> > +{
> > + PgStat_MsgSubWorkerError msg;
> > + int len;
> > +
> > + Assert(strlen(errmsg) < PGSTAT_SUBWORKERERROR_MSGLEN);
> > + len = offsetof(PgStat_MsgSubWorkerError, m_message[0]) +
> > strlen(errmsg) + 1;
> > +
> >
> > I could reproduce the problem with the following scenario:
> > Publisher:
> > create table t1 (c1 varchar);
> > create publication pub1 for table t1;
> > insert into t1 values(repeat('abcd', 5000));
> >
> > Subscriber:
> > create table t1(c1 smallint);
> > create subscription sub1 connection 'dbname=postgres port=5432'
> > publication pub1 with ( two_phase = true);
> > postgres=# select * from pg_stat_subscription_workers;
> > WARNING: terminating connection because of crash of another server process
> > DETAIL: The postmaster has commanded this server process to roll back
> > the current transaction and exit, because another server process
> > exited abnormally and possibly corrupted shared memory.
> > HINT: In a moment you should be able to reconnect to the database and
> > repeat your command.
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> > The connection to the server was lost. Attempting reset: Failed.
> >
> > Subscriber logs:
> > 2021-11-15 19:27:56.380 IST [15685] LOG: logical replication apply
> > worker for subscription "sub1" has started
> > 2021-11-15 19:27:56.384 IST [15687] LOG: logical replication table
> > synchronization worker for subscription "sub1", table "t1" has started
> > TRAP: FailedAssertion("strlen(errmsg) < PGSTAT_SUBWORKERERROR_MSGLEN",
> > File: "pgstat.c", Line: 1946, PID: 15687)
> > postgres: logical replication worker for subscription 16387 sync 16384
> > (ExceptionalCondition+0xd0)[0x55a18f3c727f]
> > postgres: logical replication worker for subscription 16387 sync 16384
> > (pgstat_report_subworker_error+0x7a)[0x55a18f126417]
> > postgres: logical replication worker for subscription 16387 sync 16384
> > (ApplyWorkerMain+0x493)[0x55a18f176611]
> > postgres: logical replication worker for subscription 16387 sync 16384
> > (StartBackgroundWorker+0x23c)[0x55a18f11f7e2]
> > postgres: logical replication worker for subscription 16387 sync 16384
> > (+0x54efc0)[0x55a18f134fc0]
> > postgres: logical replication worker for subscription 16387 sync 16384
> > (+0x54f3af)[0x55a18f1353af]
> > postgres: logical replication worker for subscription 16387 sync 16384
> > (+0x54e338)[0x55a18f134338]
> > /lib/x86_64-linux-gnu/libpthread.so.0(+0x141f0)[0x7feef84371f0]
> > /lib/x86_64-linux-gnu/libc.so.6(__select+0x57)[0x7feef81e3ac7]
> > postgres: logical replication worker for subscription 16387 sync 16384
> > (+0x5498c2)[0x55a18f12f8c2]
> > postgres: logical replication worker for subscription 16387 sync 16384
> > (PostmasterMain+0x134c)[0x55a18f12f1dd]
> > postgres: logical replication worker for subscription 16387 sync 16384
> > (+0x43c3d4)[0x55a18f0223d4]
> > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xd5)[0x7feef80fd565]
> > postgres: logical replication worker for subscription 16387 sync 16384
> > (_start+0x2e)[0x55a18ecaf4fe]
> > 2021-11-15 19:27:56.483 IST [15645] LOG: background worker "logical
> > replication worker" (PID 15687) was terminated by signal 6: Aborted
> > 2021-11-15 19:27:56.483 IST [15645] LOG: terminating any other active
> > server processes
> > 2021-11-15 19:27:56.485 IST [15645] LOG: all server processes
> > terminated; reinitializing
> >
> > Here it fails because of a long error message ""invalid input syntax
> > for type smallint:
>
> Good catch!
>
> > \"abcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabc...."
> > because we try to insert varchar type data into smallint type. Maybe
> > we should trim the error message in this case.
>
> Right. I've fixed this issue and attached an updated patch.

Thanks for the updated patch. The issue is fixed in the patch provided.
I found that in one of the scenarios the statistics is getting lost:
Test steps:
Step 1:
Setup Publisher(create 100 publications pub1...pub100 for t1...t100) like below:
===============================================
create table t1(c1 int);
create publication pub1 for table t1;
insert into t1 values(10);
insert into t1 values(10);
create table t2(c1 int);
create publication pub1 for table t2;
insert into t2 values(10);
insert into t2 values(10);
....

Script can be generated using:
while [ $a -lt 100 ]
do
a=`expr $a + 1`
echo "./psql -d postgres -p 5432 -c \"create table t$a(c1
int);\"" >> publisher.sh
echo "./psql -d postgres -p 5432 -c \"create publication pub$a
for table t$a;\"" >> publisher.sh
echo "./psql -d postgres -p 5432 -c \"insert into t$a
values(10);\"" >> publisher.sh
echo "./psql -d postgres -p 5432 -c \"insert into t$a
values(10);\"" >> publisher.sh
done

Step 2:
Setup Subscriber(create 100 subscriptions):
===============================================
create table t1(c1 int primary key);
create subscription sub1 connection 'dbname=postgres port=5432'
publication pub1;
create table t2(c1 int primary key);
create subscription sub2 connection 'dbname=postgres port=5432'
publication pub2;
....

Script can be generated using:
while [ $a -lt 100]
do
a=`expr $a + 1`
echo "./psql -d postgres -p 5433 -c \"create table t$a(c1 int
primary key);\"" >> subscriber.sh
echo "./psql -d postgres -p 5433 -c \"create subscription
sub$a connection 'dbname=postgres port=5432' publication pub$a;\"" >>
subscriber.sh
done

Step 3:
postgres=# select * from pg_stat_subscription_workers order by subid;
subid | subname | subrelid | relid | command | xid | error_count |
error_message | first_error_time | last_error_time
-------+---------+----------+-------+---------+-----+-------------+------------------------------------------------------------+----------------------------------+----------------------------------
16389 | sub1 | 16384 | 16384 | | | 17 | duplicate key value violates
unique constraint "t1_pkey" | 2021-11-17 12:01:46.141086+05:30 |
2021-11-17 12:03:13.175698+05:30
16395 | sub2 | 16390 | 16390 | | | 16 | duplicate key value violates
unique constraint "t2_pkey" | 2021-11-17 12:01:51.337055+05:30 |
2021-11-17 12:03:15.512249+05:30
16401 | sub3 | 16396 | 16396 | | | 16 | duplicate key value violates
unique constraint "t3_pkey" | 2021-11-17 12:01:51.352157+05:30 |
2021-11-17 12:03:15.802225+05:30
16407 | sub4 | 16402 | 16402 | | | 16 | duplicate key value violates
unique constraint "t4_pkey" | 2021-11-17 12:01:51.390638+05:30 |
2021-11-17 12:03:14.709496+05:30
16413 | sub5 | 16408 | 16408 | | | 16 | duplicate key value violates
unique constraint "t5_pkey" | 2021-11-17 12:01:51.418825+05:30 |
2021-11-17 12:03:15.257235+05:30

Step 4:
Then restart the publisher

Step 5:
postgres=# select * from pg_stat_subscription_workers order by subid;
subid | subname | subrelid | relid | command | xid | error_count |
error_message |
first_error_time | last_error_time
-------+---------+----------+-------+---------+-----+-------------+------------------------------------------------------------------------------------------------------------------------------------------+-----
-----------------------------+----------------------------------
16389 | sub1 | 16384 | 16384 | | | 1 | could not create replication
slot "pg_16389_sync_16384_7031422794938304519": FATAL: terminating
connection due to administrator command+| 2021
-11-17 12:03:28.201247+05:30 | 2021-11-17 12:03:28.201247+05:30
| | | | | | | server closed the connection unexpectedly +|
|
| | | | | | | This probably means the server terminated abnormally +|
|
| | | | | | | before or while proce |
|
16395 | sub2 | 16390 | 16390 | | | 18 | duplicate key value violates
unique constraint "t2_pkey" | 2021
-11-17 12:01:51.337055+05:30 | 2021-11-17 12:03:23.832585+05:30
16401 | sub3 | 16396 | 16396 | | | 18 | duplicate key value violates
unique constraint "t3_pkey" | 2021
-11-17 12:01:51.352157+05:30 | 2021-11-17 12:03:26.567873+05:30
16407 | sub4 | 16402 | 16402 | | | 1 | could not create replication
slot "pg_16407_sync_16402_7031422794938304519": FATAL: terminating
connection due to administrator command+| 2021
-11-17 12:03:28.196958+05:30 | 2021-11-17 12:03:28.196958+05:30
| | | | | | | server closed the connection unexpectedly +|
|
| | | | | | | This probably means the server terminated abnormally +|
|
| | | | | | | before or while proce |
|
16413 | sub5 | 16408 | 16408 | | | 18 | duplicate key value violates
unique constraint "t5_pkey" | 2021
-11-17 12:01:51.418825+05:30 | 2021-11-17 12:03:25.595697+05:30

Step 6:
postgres=# select * from pg_stat_subscription_workers order by subid;
subid | subname | subrelid | relid | command | xid | error_count |
error_message | first_error_time | last_error_time
-------+---------+----------+-------+---------+-----+-------------+------------------------------------------------------------+----------------------------------+----------------------------------
16389 | sub1 | 16384 | 16384 | | | 1 | duplicate key value violates
unique constraint "t1_pkey" | 2021-11-17 12:03:33.346514+05:30 |
2021-11-17 12:03:33.346514+05:30
16395 | sub2 | 16390 | 16390 | | | 19 | duplicate key value violates
unique constraint "t2_pkey" | 2021-11-17 12:01:51.337055+05:30 |
2021-11-17 12:03:33.437505+05:30
16401 | sub3 | 16396 | 16396 | | | 19 | duplicate key value violates
unique constraint "t3_pkey" | 2021-11-17 12:01:51.352157+05:30 |
2021-11-17 12:03:33.482954+05:30
16407 | sub4 | 16402 | 16402 | | | 1 | duplicate key value violates
unique constraint "t4_pkey" | 2021-11-17 12:03:33.327489+05:30 |
2021-11-17 12:03:33.327489+05:30
16413 | sub5 | 16408 | 16408 | | | 19 | duplicate key value violates
unique constraint "t5_pkey" | 2021-11-17 12:01:51.418825+05:30 |
2021-11-17 12:03:33.374522+05:30

We can see that sub1 and sub4 statistics are lost, old error_count
value is lost. I'm not sure if this behavior is ok or not. Thoughts?

Regards,
Vignesh

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-11-17 07:01:19 Re: pg_upgrade test for binary compatibility of core data types
Previous Message Pavel Stehule 2021-11-17 06:32:31 Re: Schema variables - new implementation for Postgres 15