WAL segments removed from primary despite the fact that logical replication slot needs it.

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date: 2022-10-05 15:23:54
Message-ID: Yz2hivgyjS1RfMKs@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,
We are still in process of preparing upgrades from Ubuntu Bionic with Pg
12 to Focal with Pg 14.

And today (well, yesterday) we hit something that looks like a bug.

Jumping to problem - wal files needed by logical replication were
removed from primary, despite slot being there.

Now, a bit more details.

Our replication is setup using following method:

1. We create all tables and primary keys on pg14
2. replication slot is created, with no tables
3. we add some tables to replication slot, and wait for all tables to
switch state to "ready"
4. rinse and repeat until all tables are in replication set
5. once all is ready, we add rest of schema (indexes, constraints,
functions, fkeys, and so on).

This works. Worked?

Yesterday we saw something that broke the replication.

First data from bionic side:
when | currentLSN | slot_name │ plugin │ slot_type │ datoid │ database │ temporary │ active │ active_pid │ xmin │ catalog_xmin │ restart_lsn │ confirmed_flush_lsn
2022-10-04 14:13:49 UTC | CA0E/273F2180 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743431068 | CA09/BE23BC70 | CA09/C82B2A58
2022-10-04 14:14:48 UTC | CA0E/3B5FF5C8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743451605 | CA09/C9DA89B0 | CA09/D3FA9898
2022-10-04 14:15:47 UTC | CA0E/4F7C06F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743479469 | CA09/DD32CE50 | CA09/E65E7F80
2022-10-04 14:16:46 UTC | CA0E/6CA820A8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743516374 | CA09/EDD61CB8 | CA09/F3A856D8
2022-10-04 14:17:45 UTC | CA0E/8C7F3950 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743635890 | CA0A/2157B8F8 | CA0A/2A4C48F8
2022-10-04 14:18:44 UTC | CA0E/A5D874F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743686305 | CA0A/38BE5AB0 | CA0A/3D2B0C20
2022-10-04 14:19:43 UTC | CA0E/C3F6C240 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048
2022-10-04 14:20:42 UTC | CA0E/D7DBA590 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048
2022-10-04 14:21:41 UTC | CA0E/F65C0F48 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 29207 | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:22:40 UTC | CA0F/103DBE98 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:23:39 UTC | CA0F/22D791A8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:24:38 UTC | CA0F/32E81380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:25:38 UTC | CA0F/41E4A380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:26:37 UTC | CA0F/58FD1B30 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:27:36 UTC | CA0F/6F732698 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:28:35 UTC | CA0F/855E2F50 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:29:34 UTC | CA0F/94B36450 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:30:33 UTC | CA0F/A7553BF8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:31:33 UTC | CA0F/BB10F020 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:32:32 UTC | CA0F/D1B4DF10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:33:31 UTC | CA0F/EE332240 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:34:30 UTC | CA10/3B78E10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30

As you can see around 2:20pm replication stopped fetching data.

At this time, in pg14 logs don't show any problems what so ever.
Pg_restore was doing it's work with create indexes, and some queries
like:
ALTER TABLE ONLY schema_name.table_name REPLICA IDENTITY USING INDEX index_table_name_replica_identity;

At 2:20:48pm there was:
2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0

But these errors happened earlier too (minutes/hours earlier).

But then, 3 minutes later I got:

2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replication apply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0

Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.

Unfortunately I don't know the content of pg_wal on bionic at the
moment, but currently I see:

$ ls -l pg_wal/| head
total 297534552
-rw------- 1 postgres postgres 349 Feb 11 2022 000000010000A1410000001B.00C81918.backup
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004D
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004E
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004F
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000050
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000051
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000052
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000053
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000054

So, 4 files are missing.

These were archived properly, and I tried to restore them from archive, and put
them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
slot with the same message about "requested WAL segment
000000010000CA0A00000049 has already been removed"

I know that at this moment we're hosed. This replication was removed,
everythbing cleared (after all, we accumulated already 300G of wal, and
we can't extend the FS forever, so we needed to cleanup the situation).

But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
and the fact that it did means that either we did something wrong (what,
though?), or there is some kind of bug in Pg.

Anyone has any idea on how to investigate/fix that?

Best regards,

depesz

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2022-10-05 15:30:22 Re: Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records
Previous Message Noah Misch 2022-10-05 11:02:09 Re: BUG #17616: Silently skipping to update data to the database with owner privileges using flyway scripts