Trying to understand a failed upgrade in AWS RDS

From: Mike Lissner <mlissner(at)michaeljaylissner(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Trying to understand a failed upgrade in AWS RDS
Date: 2023-05-19 15:17:10
Message-ID: CAMp9=Ezf9OyuQaQUZDHEauTp-Dn_pPdS6xNmcFJHuQM_qsmZvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all,

In AWS RDS, we are using logical replication between a postgresql 14
publisher and a postgresql 10 subscriber. The subscriber is rather old, so
yesterday I tried to update it using AWS's built in upgrade tool (which
uses pg_upgrade behind the scenes).

I did a pretty thorough test run before beginning, but the live run went
pretty poorly. My process was:

1. Disable the subscription to pg10.
2. Run RDS's upgrade (which runs pg_upgrade).
3. Re-Enable the subscription to the newly upgraded server.

The idea was that the publisher could still be live and collect changes,
and then on step 3, those changes would flush to the newly upgraded server.

When I hit step three, things went awry. From what I can tell, it seems
like pg_upgrade might have wiped out the LSN location of the subscriber,
because I was getting many messages in the logs saying:

2023-05-19 01:01:09
UTC:100.20.224.120(56536):django(at)courtlistener:[29669]:STATEMENT:
CREATE_REPLICATION_SLOT "pg_18278_sync_86449755_7234675743763347169"
LOGICAL pgoutput USE_SNAPSHOT2023-05-19 01:01:09
UTC:100.20.224.120(56550):django(at)courtlistener:[29670]:ERROR:
replication slot "pg_18278_sync_16561_7234675743763347169" does not
exist2023-05-19 01:01:09
UTC:100.20.224.120(56550):django(at)courtlistener:[29670]:STATEMENT:
DROP_REPLICATION_SLOT pg_18278_sync_16561_7234675743763347169
WAIT2023-05-19 01:01:09
UTC:100.20.224.120(56550):django(at)courtlistener:[29670]:ERROR: all
replication slots are in use2023-05-19 01:01:09
UTC:100.20.224.120(56550):django(at)courtlistener:[29670]:HINT: Free one
or increase max_replication_slots.

I followed those instructions, and upped max_replication_slots to 200. That
fixed that error, but then I had errors about COPY commands failing, and
looking in the publisher I saw about 150 slots like:

select * from pg_replication_slots ;
slot_name | plugin | slot_type |
datoid | database | temporary | active | active_pid | xmin |
catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status |
safe_wal_size | two_phase
--------------------------------------------+----------+-----------+--------+---------------+-----------+--------+------------+------+--------------+--------------+---------------------+------------+---------------+-----------
pg_18278_sync_86449408_7234675743763347169 | pgoutput | logical |
16428 | courtlistener | f | t | 6906 | |
859962500 | EA5/954A9F18 | | reserved |
| f
pg_18278_sync_20492279_7234675743763347169 | pgoutput | logical |
16428 | courtlistener | f | f | | |
859962448 | EA5/9548EDF0 | EA5/9548EE28 | reserved |
| f
pg_18278_sync_16940_7234675743763347169 | pgoutput | logical |
16428 | courtlistener | f | f | | |
859962448 | EA5/9548EE60 | EA5/9548EE98 | reserved |
| f

So this looks like it's trying to sync all of the existing tables all over
again when I re-enabled the subscription.

Does that make sense? In the future, I'll DROP the subscription and then
create a new one with copy_data=False, but this was a real gotcha.

Anybody know what's going on here?

Thanks,

Mike

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Mike Lissner 2023-05-19 16:49:13 Re: Trying to understand a failed upgrade in AWS RDS
Previous Message Victor Nordam Suadicani 2023-05-19 14:55:00 Re: Is there a good way to handle sum types (or tagged unions) in PostgreSQL?