Re: Trying to understand a failed upgrade in AWS RDS

From: Mike Lissner <mlissner(at)michaeljaylissner(dot)com>
To: "Elterman, Michael" <melterman(at)enova(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Trying to understand a failed upgrade in AWS RDS
Date: 2023-05-19 21:56:34
Message-ID: CAMp9=EzwL1cYBWqFkT5SEJvJg6uRfUpWwpWgMUcmUcK9heSHhQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks for the suggestions. I think in the future I'll do something like
this rather than try to re-use existing subscriptions.

I'm still trying to understand what went wrong though. Putting a finer
point on my question: Does pg_upgrade mess up disabled subscriptions?

On Fri, May 19, 2023 at 1:55 PM Elterman, Michael <melterman(at)enova(dot)com>
wrote:

> Please, use the following runbook.
> 1. Disable the subscription to pg10.
> 2. Disable Application Users on Publisher.
> 3. Drop all replication slots on Publisher (The upgrade can not be
> executed if there are any replication slots)
> 4. Run RDS's upgrade (which runs pg_upgrade).
> 5. Recreate replication slots with the same names.
> 6. Enable Application Users on Publisher.
> 7. Re-Enable the subscriptions to the newly upgraded server.
> Good luck
>
> On Fri, May 19, 2023 at 11:49 AM Mike Lissner <
> mlissner(at)michaeljaylissner(dot)com> wrote:
>
>> I also am realizing belatedly that my solution of dropping the subscriber
>> probably won't work anyway, since I'd lose the changes on the publisher for
>> the duration of the upgrade. Maybe I could drop the subscription while
>> keeping the slot on the publisher, and then create a new subscription after
>> the upgrade using that slot and copy_data=False? Getting wonky.
>>
>> On Fri, May 19, 2023 at 8:17 AM Mike Lissner <
>> mlissner(at)michaeljaylissner(dot)com> wrote:
>>
>>> Hi all,
>>>
>>> In AWS RDS, we are using logical replication between a postgresql 14
>>> publisher and a postgresql 10 subscriber. The subscriber is rather old, so
>>> yesterday I tried to update it using AWS's built in upgrade tool (which
>>> uses pg_upgrade behind the scenes).
>>>
>>> I did a pretty thorough test run before beginning, but the live run went
>>> pretty poorly. My process was:
>>>
>>> 1. Disable the subscription to pg10.
>>> 2. Run RDS's upgrade (which runs pg_upgrade).
>>> 3. Re-Enable the subscription to the newly upgraded server.
>>>
>>> The idea was that the publisher could still be live and collect changes,
>>> and then on step 3, those changes would flush to the newly upgraded server.
>>>
>>> When I hit step three, things went awry. From what I can tell, it seems
>>> like pg_upgrade might have wiped out the LSN location of the subscriber,
>>> because I was getting many messages in the logs saying:
>>>
>>> 2023-05-19 01:01:09 UTC:100.20.224.120(56536):django(at)courtlistener:[29669]:STATEMENT: CREATE_REPLICATION_SLOT "pg_18278_sync_86449755_7234675743763347169" LOGICAL pgoutput USE_SNAPSHOT2023-05-19 01:01:09 UTC:100.20.224.120(56550):django(at)courtlistener:[29670]:ERROR: replication slot "pg_18278_sync_16561_7234675743763347169" does not exist2023-05-19 01:01:09 UTC:100.20.224.120(56550):django(at)courtlistener:[29670]:STATEMENT: DROP_REPLICATION_SLOT pg_18278_sync_16561_7234675743763347169 WAIT2023-05-19 01:01:09 UTC:100.20.224.120(56550):django(at)courtlistener:[29670]:ERROR: all replication slots are in use2023-05-19 01:01:09 UTC:100.20.224.120(56550):django(at)courtlistener:[29670]:HINT: Free one or increase max_replication_slots.
>>>
>>> I followed those instructions, and upped max_replication_slots to 200.
>>> That fixed that error, but then I had errors about COPY commands failing,
>>> and looking in the publisher I saw about 150 slots like:
>>>
>>> select * from pg_replication_slots ;
>>> slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase --------------------------------------------+----------+-----------+--------+---------------+-----------+--------+------------+------+--------------+--------------+---------------------+------------+---------------+-----------
>>> pg_18278_sync_86449408_7234675743763347169 | pgoutput | logical | 16428 | courtlistener | f | t | 6906 | | 859962500 | EA5/954A9F18 | | reserved | | f
>>> pg_18278_sync_20492279_7234675743763347169 | pgoutput | logical | 16428 | courtlistener | f | f | | | 859962448 | EA5/9548EDF0 | EA5/9548EE28 | reserved | | f
>>> pg_18278_sync_16940_7234675743763347169 | pgoutput | logical | 16428 | courtlistener | f | f | | | 859962448 | EA5/9548EE60 | EA5/9548EE98 | reserved | | f
>>>
>>>
>>> So this looks like it's trying to sync all of the existing tables all
>>> over again when I re-enabled the subscription.
>>>
>>> Does that make sense? In the future, I'll DROP the subscription and then
>>> create a new one with copy_data=False, but this was a real gotcha.
>>>
>>> Anybody know what's going on here?
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tiffany Thang 2023-05-20 04:36:27 Profiling a function call
Previous Message Ziga 2023-05-19 21:31:00 Re: Adding SHOW CREATE TABLE