Re: More race conditions in logical replication

From: Noah Misch <noah(at)leadboat(dot)com>
To: peter(dot)eisentraut(at)2ndquadrant(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: More race conditions in logical replication
Date: 2017-07-04 04:42:35
Message-ID: 20170704044235.GB2113376@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 02, 2017 at 07:54:48PM -0400, Tom Lane wrote:
> I noticed a recent failure that looked suspiciously like a race condition:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07
>
> The critical bit in the log file is
>
> error running SQL: 'psql:<stdin>:1: ERROR: could not drop the replication slot "tap_sub" on publisher
> DETAIL: The error was: ERROR: replication slot "tap_sub" is active for PID 3866790'
> while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION tap_sub' at /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm line 1198.
>
> After poking at it a bit, I found that I can cause several different
> failures of this ilk in the subscription tests by injecting delays at
> the points where a slot's active_pid is about to be cleared, as in the
> attached patch (which also adds some extra printouts for debugging
> purposes; none of that is meant for commit). It seems clear that there
> is inadequate interlocking going on when we kill and restart a logical
> rep worker: we're trying to start a new one before the old one has
> gotten out of the slot.
>
> I'm not particularly interested in fixing this myself, so I'm just
> going to add it to the open items list.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-07-04 05:25:55 Re: Request more documentation for incompatibility of parallelism and plpgsql exec_run_select
Previous Message Noah Misch 2017-07-04 04:34:46 Re: Race conditions with WAL sender PID lookups