Re: More race conditions in logical replication

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: More race conditions in logical replication
Date: 2017-08-08 00:14:33
Message-ID: 20170808001433.rozlseaf4m2wkw3n@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera wrote:
> Alvaro Herrera wrote:

> > I just noticed that Jacana failed again in the subscription test, and it
> > looks like it's related to this. I'll take a look tomorrow if no one
> > beats me to it.
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2017-07-26%2014%3A39%3A54
>
> Ahh, I misread the message. This one is actually about the replication
> *origin* being still active, not the replication *slot*. I suppose
> origins need the same treatment as we just did for slots. Anybody wants
> to volunteer a patch?

So here's a patch. I was able to reproduce the issue locally by adding
a couple of sleeps, just like Tom did in the case of replication slots.
With this patch it seems the problem is solved. The mechanism is the
same as Petr used to fix replication origins -- if an origin is busy,
sleep on the CV until someone else signals us; and each time the
acquirer PID changes, broadcast a signal. Like before, this is likely
to be a problem in older branches too, but we have no CVs to backpatch
this.

BTW, I noticed that the PG_WAIT_LOCK value that we're using for wait
event here (and in the replication slot case) is bogus. We probably
need something new here.

I found four instances of this problem in buildfarm,
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2017-07-26%2014%3A39%3A54
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2017-07-28%2021%3A00%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2017-07-31%2007%3A03%3A20
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-08-04%2004%3A34%3A04

Jacana only stopped having it because it broke for unrelated reasons. I
bet we'll see another failure soon ...

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
drop-origins.patch text/plain 5.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-08-08 00:36:51 Re: More race conditions in logical replication
Previous Message Noah Misch 2017-08-08 00:10:14 Re: possible encoding issues with libxml2 functions