Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: depesz(at)depesz(dot)com
Cc: pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date: 2022-10-16 05:05:17
Message-ID: CAA4eK1JTE0tLgR5qzVECrKJV5VVTnDa0k5LQ4TOkBAA=MySj3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
<depesz(at)depesz(dot)com> wrote:
>
> Hi,
> We are still in process of preparing upgrades from Ubuntu Bionic with Pg
> 12 to Focal with Pg 14.
>
> And today (well, yesterday) we hit something that looks like a bug.
>
> Jumping to problem - wal files needed by logical replication were
> removed from primary, despite slot being there.
>
> Now, a bit more details.
>
> Our replication is setup using following method:
>
> 1. We create all tables and primary keys on pg14
> 2. replication slot is created, with no tables
> 3. we add some tables to replication slot, and wait for all tables to
> switch state to "ready"
> 4. rinse and repeat until all tables are in replication set
> 5. once all is ready, we add rest of schema (indexes, constraints,
> functions, fkeys, and so on).
>
> This works. Worked?
>
> Yesterday we saw something that broke the replication.
>
> First data from bionic side:
> when | currentLSN | slot_name │ plugin │ slot_type │ datoid │ database │ temporary │ active │ active_pid │ xmin │ catalog_xmin │ restart_lsn │ confirmed_flush_lsn
> 2022-10-04 14:13:49 UTC | CA0E/273F2180 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743431068 | CA09/BE23BC70 | CA09/C82B2A58
> 2022-10-04 14:14:48 UTC | CA0E/3B5FF5C8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743451605 | CA09/C9DA89B0 | CA09/D3FA9898
> 2022-10-04 14:15:47 UTC | CA0E/4F7C06F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743479469 | CA09/DD32CE50 | CA09/E65E7F80
> 2022-10-04 14:16:46 UTC | CA0E/6CA820A8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743516374 | CA09/EDD61CB8 | CA09/F3A856D8
> 2022-10-04 14:17:45 UTC | CA0E/8C7F3950 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743635890 | CA0A/2157B8F8 | CA0A/2A4C48F8
> 2022-10-04 14:18:44 UTC | CA0E/A5D874F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743686305 | CA0A/38BE5AB0 | CA0A/3D2B0C20
> 2022-10-04 14:19:43 UTC | CA0E/C3F6C240 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048
> 2022-10-04 14:20:42 UTC | CA0E/D7DBA590 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048
> 2022-10-04 14:21:41 UTC | CA0E/F65C0F48 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 29207 | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:22:40 UTC | CA0F/103DBE98 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:23:39 UTC | CA0F/22D791A8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:24:38 UTC | CA0F/32E81380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:25:38 UTC | CA0F/41E4A380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:26:37 UTC | CA0F/58FD1B30 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:27:36 UTC | CA0F/6F732698 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:28:35 UTC | CA0F/855E2F50 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:29:34 UTC | CA0F/94B36450 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:30:33 UTC | CA0F/A7553BF8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:31:33 UTC | CA0F/BB10F020 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:32:32 UTC | CA0F/D1B4DF10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:33:31 UTC | CA0F/EE332240 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
> 2022-10-04 14:34:30 UTC | CA10/3B78E10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
>
> As you can see around 2:20pm replication stopped fetching data.
>

Around that time, it seems the slot is also marked inactive probably
due to walsender exit.

> At this time, in pg14 logs don't show any problems what so ever.
> Pg_restore was doing it's work with create indexes, and some queries
> like:
> ALTER TABLE ONLY schema_name.table_name REPLICA IDENTITY USING INDEX index_table_name_replica_identity;
>
> At 2:20:48pm there was:
> 2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0
>
> But these errors happened earlier too (minutes/hours earlier).
>
> But then, 3 minutes later I got:
>
> 2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.",,,,,,,,,"","logical replication worker",,0
> 2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replication apply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
> 2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
> 2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical replication worker",,0
> 2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0
>
>
> Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
>

This is quite strange and I am not able to see the reason why this can
happen. The only way to debug this problem that comes to mind is to
add more LOGS around the code that removes the WAL files. For example,
we can try to print the value of minimumslotLSN (keep) and logSegNo in
KeepLogSeg().

> Unfortunately I don't know the content of pg_wal on bionic at the
> moment, but currently I see:
>
> $ ls -l pg_wal/| head
> total 297534552
> -rw------- 1 postgres postgres 349 Feb 11 2022 000000010000A1410000001B.00C81918.backup
> -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004D
> -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004E
> -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004F
> -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000050
> -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000051
> -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000052
> -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000053
> -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000054
>
> So, 4 files are missing.
>
> These were archived properly, and I tried to restore them from archive, and put
> them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
> slot with the same message about "requested WAL segment
> 000000010000CA0A00000049 has already been removed"
>

I think you are seeing this behavior because we update the
lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
then we use that to give the error you are seeing.

> I know that at this moment we're hosed. This replication was removed,
> everythbing cleared (after all, we accumulated already 300G of wal, and
> we can't extend the FS forever, so we needed to cleanup the situation).
>
> But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
> and the fact that it did means that either we did something wrong (what,
> though?), or there is some kind of bug in Pg.
>
> Anyone has any idea on how to investigate/fix that?
>

As suggested above, I feel we need to add some LOGS to see what is
going on unless you have a reproducible testcase.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2022-10-16 16:50:14 BUG #17644: EXPLAIN VERBOSE fails on query with SEARCH BREADTH FIRST
Previous Message Thomas Munro 2022-10-16 03:31:41 Re: BUG #17641: postgresql service persistently restarts because "could not open shared memory segment"