Re: subscriptionCheck failures on nightjar

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: subscriptionCheck failures on nightjar
Date: 2019-02-13 16:57:32
Message-ID: 27965.1550077052@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
> On Mon, Feb 11, 2019 at 7:31 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> 2019-02-10 23:55:58.798 EST [40728] sub1 PANIC: could not open file "pg_logical/snapshots/0-160B578.snap": No such file or directory

> <pokes at totally unfamiliar code>

> They get atomically renamed into place, which seems kosher even if
> snapshots for the same LSN are created concurrently by different
> backends (and tracing syscalls confirms that that does occasionally
> happen). It's hard to believe that nightjar's rename() ceased to be
> atomic a couple of months ago. It looks like the only way for files
> to get unlinked after that is by CheckPointSnapBuild() deciding they
> are too old.

> Hmm. Could this be relevant, and cause a well timed checkpoint to
> unlink files too soon?
> 2019-02-12 21:52:58.304 EST [22922] WARNING: out of logical
> replication worker slots

I've managed to reproduce this locally, and obtained this PANIC:

log/010_truncate_publisher.log:2019-02-13 11:29:04.759 EST [9973] sub3 PANIC: could not open file "pg_logical/snapshots/0-16067B0.snap": No such file or directory

with this stack trace:

#0 0x0000000801ab610c in kill () from /lib/libc.so.7
#1 0x0000000801ab4d3b in abort () from /lib/libc.so.7
#2 0x000000000089202e in errfinish (dummy=Variable "dummy" is not available.
) at elog.c:552
#3 0x000000000075d339 in fsync_fname_ext (
fname=0x7fffffffba20 "pg_logical/snapshots/0-16067B0.snap", isdir=Variable "isdir" is not available.
)
at fd.c:3372
#4 0x0000000000730c75 in SnapBuildSerialize (builder=0x80243c118,
lsn=23095216) at snapbuild.c:1669
#5 0x0000000000731297 in SnapBuildProcessRunningXacts (builder=0x80243c118,
lsn=23095216, running=0x8024424f0) at snapbuild.c:1110
#6 0x0000000000722eac in LogicalDecodingProcessRecord (ctx=0x802414118,
record=0x8024143d8) at decode.c:318
#7 0x0000000000736f30 in XLogSendLogical () at walsender.c:2826
#8 0x00000000007389c7 in WalSndLoop (send_data=0x736ed0 <XLogSendLogical>)
at walsender.c:2184
#9 0x0000000000738c3d in StartLogicalReplication (cmd=Variable "cmd" is not available.
) at walsender.c:1118
#10 0x0000000000739895 in exec_replication_command (
cmd_string=0x80240e118 "START_REPLICATION SLOT \"sub3\" LOGICAL 0/0 (proto_version '1', publication_names '\"pub3\"')") at walsender.c:1536
#11 0x000000000078b272 in PostgresMain (argc=-14464, argv=Variable "argv" is not available.
) at postgres.c:4252
#12 0x00000000007113fa in PostmasterMain (argc=-14256, argv=0x7fffffffcc60)
at postmaster.c:4382

So the problem seems to boil down to "somebody removed the snapshot
file between the time we renamed it into place and the time we tried
to fsync it".

I do find messages like the one you mention, but they are on the
subscriber not the publisher side, so I'm not sure if this matches
the scenario you have in mind?

log/010_truncate_subscriber.log:2019-02-13 11:29:02.343 EST [9970] WARNING: out of logical replication worker slots
log/010_truncate_subscriber.log:2019-02-13 11:29:02.344 EST [9970] WARNING: out of logical replication worker slots
log/010_truncate_subscriber.log:2019-02-13 11:29:03.401 EST [9970] WARNING: out of logical replication worker slots

Anyway, I think we might be able to fix this along the lines of

CloseTransientFile(fd);

+ /* ensure snapshot file is down to stable storage */
+ fsync_fname(tmppath, false);
fsync_fname("pg_logical/snapshots", true);

/*
* We may overwrite the work from some other backend, but that's ok, our
* snapshot is valid as well, we'll just have done some superfluous work.
*/
if (rename(tmppath, path) != 0)
{
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not rename file \"%s\" to \"%s\": %m",
tmppath, path)));
}

- /* make sure we persist */
- fsync_fname(path, false);
+ /* ensure we persist the file rename */
fsync_fname("pg_logical/snapshots", true);

The existing code here seems simply wacky/unsafe to me regardless
of this race condition: couldn't it potentially result in a corrupt
snapshot file appearing with a valid name, if the system crashes
after persisting the rename but before it's pushed the data out?

I also wonder why bother with the directory sync just before the
rename.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-02-13 17:03:39 Re: reducing isolation tests runtime
Previous Message Abhishek Agrawal 2019-02-13 16:56:03 Regarding participating in GSOC 2019 with PostgreSQL