Re: Logical replication stops dropping used initial-sync replication slots

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Logical replication stops dropping used initial-sync replication slots
Date: 2022-03-25 08:38:56
Message-ID: 20220325083856.GA26455@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Mar 22, 2022 at 12:26:46PM +0100, hubert depesz lubaczewski wrote:
> On Tue, Mar 22, 2022 at 12:32:41PM +0530, vignesh C wrote:
> > > That wouldn't do me any good. The server I'm trying to upgrade now is
> > > test bench to make sure that I can upgrade using logical replication *at
> > > all*.
> > > Once I will have verified that I can, I have couple hundred servers to
> > > upgrade, that will have concurrent access all the time.
> > Is it possible to get a reproducible test or a script for this problem?
>
> Well, given that we're talking about 38GB database with 60k tables
> i find that unlikely.
>
> But I'll try if I can figure out a way to replicate the problem with
> something that is shareable.

OK. I don't think it will be simple/possible to replicate.

Reason:

Did setup 10 separate environments, all with the same database on bionic
12.9.

Then I created publication with all tables in specific schemas, and
subscribed to it from separate focal pg 14.2 servers.

In total, used 20 servers to replicate the test 10 times.

Effect: in one case replication didn't finish initial sync.

Currently, it's 2022-03-25 08:32:14.10081+00, and on the one source that
had failed with subscription i see:
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+---------------------
focal14 | pgoutput | logical | 16606 | dbname | f | f | | | 161822728 | 8E9/3E70BB58 | 8E9/3E70BB58
pg_3548429_sync_3343043_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32385 | | 161824826 | 8F0/18327650 | 8F0/1ABF89C0
pg_3548429_sync_3079712_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32378 | | 161824826 | 8F0/18327650 | 8F0/1ABF89C0
pg_3548429_sync_3145162_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32386 | | 161824826 | 8F0/18327650 | 8F0/1ABF89C0
pg_3548429_sync_2993718_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32387 | | 161824826 | 8F0/18327650 | 8F0/1ABF89C0
pg_3548429_sync_3355696_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32379 | | 161824826 | 8F0/18327650 | 8F0/1ABF89C0
pg_3548429_sync_3099460_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32380 | | 161824826 | 8F0/18327650 | 8F0/1ABF89C0
pg_3548429_sync_3162651_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32381 | | 161824826 | 8F0/18327650 | 8F0/1ABF89C0
pg_3548429_sync_3157454_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32382 | | 161824826 | 8F0/18327650 | 8F0/1ABF89C0
pg_3548429_sync_3084198_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32383 | | 161824826 | 8F0/18327650 | 8F0/1ABF89C0
pg_3548429_sync_3142573_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32384 | | 161824826 | 8F0/18327650 | 8F0/1ABF89C0
(11 rows)

pg_stat_replication shows 10 sync connections, and that's it. their sent_lsn,
write_lsn, flush_lsn, replay_lsn seems to be progressing nicely, but nothing is
changing.

pg_stat_activity contains:

=# select xact_start, query_start, backend_type, query from pg_stat_activity where state is distinct from 'idle';
xact_start │ query_start │ backend_type │ query
═══════════════════════════════╪═══════════════════════════════╪══════════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
[null] │ [null] │ autovacuum launcher │
[null] │ 2022-03-24 20:45:30.757582+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_2993718_7077521990523695096" LOGICAL 8E9/40FF0170 (proto_version '1', publication_names '"focal14"')
[null] │ [null] │ pglogical supervisor │
[null] │ [null] │ logical replication launcher │
[null] │ 2022-03-24 20:45:30.730818+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3343043_7077521990523695096" LOGICAL 8E9/40FF0100 (proto_version '1', publication_names '"focal14"')
[null] │ 2022-03-24 20:45:31.65015+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3079712_7077521990523695096" LOGICAL 8E9/40FEFF60 (proto_version '1', publication_names '"focal14"')
[null] │ 2022-03-24 20:45:30.733366+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3145162_7077521990523695096" LOGICAL 8E9/40FF0138 (proto_version '1', publication_names '"focal14"')
[null] │ 2022-03-24 20:45:30.76285+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3355696_7077521990523695096" LOGICAL 8E9/40FEFF98 (proto_version '1', publication_names '"focal14"')
[null] │ 2022-03-24 20:45:31.663299+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3099460_7077521990523695096" LOGICAL 8E9/40FEFFD0 (proto_version '1', publication_names '"focal14"')
[null] │ 2022-03-24 20:45:31.676412+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3162651_7077521990523695096" LOGICAL 8E9/40FF0020 (proto_version '1', publication_names '"focal14"')
[null] │ 2022-03-24 20:45:31.691807+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3157454_7077521990523695096" LOGICAL 8E9/40FF0058 (proto_version '1', publication_names '"focal14"')
[null] │ 2022-03-24 20:45:31.69548+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3084198_7077521990523695096" LOGICAL 8E9/40FF0090 (proto_version '1', publication_names '"focal14"')
[null] │ 2022-03-24 20:45:30.974833+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3142573_7077521990523695096" LOGICAL 8E9/40FF00C8 (proto_version '1', publication_names '"focal14"')
2022-03-25 08:35:27.971161+00 │ 2022-03-25 08:35:27.971161+00 │ client backend │ select xact_start, query_start, backend_type, query from pg_stat_activity where state is distinct from 'idle';
[null] │ [null] │ background writer │
[null] │ [null] │ checkpointer │
[null] │ [null] │ walwriter │
(17 rows)

So, it looks to me that there is some random-ish condition that causes
the problem.

I tried to insulate pgs from the world as well as I could: no
monitoring, no cron, no external access. Just replication and my simple
load generator, running loop:

CREATE TABLE depesz_test as SELECT i, repeat('payload', 50) as payload FROM generate_series(1,100000) i;
CREATE INDEX depesz_test_idx on depesz_test(i);
DROP TABLE depesz_test;

every minute.

I'm out of ideas. The good-ish news is that when I tried replication in
smaller sets (30k tables, not 60k), and/or with just
2 max_sync_workers_per_subscription, and not 10, it never failed. so
I guess I'll go that way.

Best regards,

depesz

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message egashira.yusuke@fujitsu.com 2022-03-25 13:34:16 RE: Reconnect a single connection used by multiple threads in embedded SQL in C application causes error.
Previous Message PG Bug reporting form 2022-03-25 07:52:57 BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.