Re: Replication with Patroni not working after killing secondary and starting again

From: Zb B <zbig(dot)poland(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Replication with Patroni not working after killing secondary and starting again
Date: 2022-05-04 08:21:56
Message-ID: CAKwARkbqwVc35dZWFLvrwL_6FxvwJSq-UEzFareEcoLvqqNYsA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> What does `patronictl list` show during that interval?

Well. I can't repeat the situation anymore. Now the replication starts
immediately after starting the patroni on secondary. I did several
switchover commands meanwhile though

Meanwhile I did another test where I run a Java app with a large number of
*short* transactions (inserts) and during execution of this app I do the
patroni switchover command:

patronictl -c /etc/patroni/patroni.yml switchover

It turned out the records were not replicated to the secondary and when I
tried to execute the switchover command on the primary I got the following
error:
Error: This cluster has no master

When I tried to execute the switchover command on the secondary it worked
but because there was a discrepancy between the primary and secondary the
records on the old primary were rolled back (the number of records on
primary and secondary became the same - the same as it was on the old
secondary)

Apparently there is something wrong with my cluster. How to debug i?. Do I
need to configure anything so the replication is synchronous?

pt., 29 kwi 2022 o 22:33 Peter J. Holzer <hjp-pgsql(at)hjp(dot)at> napisał(a):

> On 2022-04-28 11:09:12 +0200, Zb B wrote:
> > > When the secondary starts up it should continue replicating from where
> > > it stopped. However, it can only do this if the necessary information
> is
> > > still available. If WAL files have been deleted in the mean time. it
> > > can't replay them. There should be error messages in your logs on what
> > > went wrong
> >
> > I did another test using different wal_sender_timeout parameter, as the
> time of
> > the secondary being shut down was longer than the default 60s for this
> > parameter.
>
> I don't think this will help. It will just make the primary slower in
> noticing that the secondary is gone.
>
>
> > I was hoping it would help but the result was the same (records were not
> > replicated to the secondary after the patroni start). Well, I just
> verified
> > again that the records were replicated after about 15 minutes to the
> secondary,
> > so probably the timeout setting helped, or I was not patient enough
> before.
>
> The latter, I suspect. Although I'm surprised that it takes so long. In
> my experience, that takes only a few seconds, certainly less than a
> minute for replication to start (how long it takes to finish depends on
> the amount of data, of course).
>
> Patroni can nuke the secondary database and create a fresh copy
> (using basebackup). That might take 15 minutes (depending on the
> database size). I don't think it does that automatically, though. Also I
> think you would have noticed that.
>
> What does `patronictl list` show during that interval?
>
>
> > Is it normal to wait so long for the replication? (the original
> > transaction in primary took about 5 minutes and was about 3000 small
> > records). I am providing more details for completeness below:
> >
> > I get the following errors on the primary DB:
> > 2022-04-28 04:36:50.544 EDT [13794] WARNING: archive_mode enabled, yet
> > archive_command is not set
> > 2022-04-28 04:37:34.893 EDT [14755] ERROR: replication slot
> "xyzd3riardb05"
> > does not exist
> > 2022-04-28 04:37:34.893 EDT [14755] STATEMENT: START_REPLICATION SLOT
> > "xyzd3riardb05" 0/7000000 TIMELINE 18
> ...
> > and after some time such errors stop to appear.
>
> So the replication slot is probably created after some time and then
> replication starts to work.
>
> I think that replication slot is managed by Patroni. So the question
> would be: Why does Patroni take so long to create it? Did it log
> anything?
>
> hp
>
> --
> _ | Peter J. Holzer | Story must make more sense than reality.
> |_|_) | |
> | | | hjp(at)hjp(dot)at | -- Charles Stross, "Creative writing
> __/ | http://www.hjp.at/ | challenge!"
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Paul van der Linden 2022-05-04 10:05:01 Completely wrong queryplan
Previous Message Aaron Gray 2022-05-04 07:38:13 Re: Whole Database or Table AES encryption