Migrating a Patroni cluster from Ubuntu 16.04 to Ubuntu 18.04

From: "Peter J(dot) Holzer" <hjp-pgsql(at)hjp(dot)at>
To: pgsql-general(at)postgresql(dot)org
Subject: Migrating a Patroni cluster from Ubuntu 16.04 to Ubuntu 18.04
Date: 2019-05-30 10:38:47
Message-ID: 20190530103847.kpz7tuy22kaca4qw@hjp.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I upgraded a cluster from Ubuntu 16.04 to Ubuntu 18.04 this week and
since this wasn't as smooth as I hoped I'm posting my experiences here
hoping that it will help others.

First: No production systems were harmed in the making of this report.
We have a test cluster, and it really keeps adrenaline levels down. I
might have gotten a tad nervous if this had been in production.

Start scenario:

* 2 Nodes (we'll call them A and B) running
* Ubuntu 16.04
* Patroni 1.4.3 (3rd party)
* etcd 2.2.5 (from Ubuntu)
* PostgreSQL 10.8 (from pgdg)

* 1 Node (E) running
* Ubuntu 16.04
* etcd 2.2.5 (from Ubuntu)

So, A and B are the database nodes and E is just there to provide the
quorum for the etcd cluster. (And obviously, Patroni is configured to
use etcd).

Goal: Upgrade to Ubuntu 18.04, leave everything else the same (as far as
possible).

At the start of the upgrade, A was the master, so I started with B.

do-release-upgrade successfully and without worrying warnings upgraded
to Ubuntu 18.04.

But after the machine rebooted, etcd wouldn't start up again.

What went wrong? Ubuntu 16.04 came with etcd 2.2.5, Ubuntu 18.04
includes etcd 3.2.17. But you can't upgrade directly from 2.2.x to
3.2.x. You have to upgrade to 2.3.x first, then to 3.0.x and finally to
3.2.x. And you have to do each step for the whole cluster before
proceeding to the next.

Since there are no Ubuntu packages for etcd 2.3 and 3.0 I fetched the
binary releases for etcd-v2.3.8 and etcd-v3.0.17 from github. The
executables are statically linked, so you can just copy them into
/usr/bin without worrying about dependencies.

So I restarted B with etcd 2.3 and it joined the cluster (of course it
wasn't quite so straightforward - in exploring different possibilities I
had corrupted /var/lib/etcd, so I had to remove the node, clean out the
data, and add the node again). Same for the other two nodes (the docs
say you should wait for 120 seconds after restarting each node and that
really seems to be necessary; if you are impatient you may just wind up
doing it again), and then the whole cluster was on protocol version 2.3
(this can be tested with "curl http://localhost:2379/version", and
again, you really want to test that before proceding to the next step).

Then I did the same dance for etcd 3.0.

And finally I reinstalled etcd-server and etc-client from the Ubuntu
repo. So now B was on 3.2 and the other two nodes still on 3.0, a
compatible combination.

Next problem: Patroni. The 3rd party Patroni package used Python 2.7 and
there was a problem in some python library. Since Ubuntu now also
includes a patroni package (although for 1.4.2, a bit older than the one
we had) I didn't investigate that further and just installed the Ubuntu
package. I had to rename the config file and install two additional
packages (python3-etcd python3-etcd3gw) for etcd support, but that was
kind of obvious.

So now we had a working cluster again with one machine upgraded to
Ubuntu 18. Yay! \o/

Next node is E, which is only running etcd. Since it already has 3.0,
the upgrade to 3.2 is smooth.

Finally A: Swith the master over to B, run do-release-upgrade, after the
reboot, reinstall patroni (+depencies, +rename config). And ...
everything works.

hp

--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp(at)hjp(dot)at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Browse pgsql-general by date

  From Date Subject
Next Message Тарасов Георгий Витальевич 2019-05-30 15:14:01 compiling PL/pgSQL plugin with C++
Previous Message AYahorau 2019-05-30 09:07:35 Re: terminating walsender process due to replication timeout