Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction
Date: 2020-03-28 18:39:04
Message-ID: 20200328183904.GI20103@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 28, 2020 at 01:13:54PM -0400, Tom Lane wrote:
> The buildfarm just showed up another instability in the test cases
> we added:

Yea, as you said, this is an issue with the *testcase*. The function behavior
didn't change, we just weren't previously exercising it.

> select (w).size = :segsize as ok
> from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
> - ok
> -----
> - t
> -(1 row)
> -
> +ERROR: could not stat file "pg_wal/000000010000000000000078": No such file or directory
> select count(*) >= 0 as ok from pg_ls_archive_statusdir();
> ok
> ----
>
> It's pretty obvious what happened here: concurrent activity renamed or
> removed the WAL segment between when we saw it in the directory and
> when we tried to stat() it.
>
> This seems like it would be just as much of a hazard for field usage
> as it is for regression testing,

That's clearly true for pg_ls_waldir(), which call pg_ls_dir_files, and
includes some metadata columns.

> so I propose that we fix these directory-scanning functions to silently
> ignore ENOENT failures from stat(). Are there any for which we should not do
> that?

I think it's reasonable to ignore transient ENOENT for tmpdir, logdir, and
probably archive_statusdir. That doesn't currently affect pg_ls_dir(), which
lists file but not metadata for an arbitrary dir, so doesn't call stat().

Note that dangling links in the other functions currently cause (wrong [0])
error. I guess it should be documented if broken link is will be ignored due
to ENOENT.

Maybe we should lstat() the file to determine if it's a dangling link; if
lstat() fails, then skip it. Currently, we use stat(), which shows metdata of
a link's *target*. Maybe we'd change that.

Note that I have a patch which generalizes pg_ls_dir_files and makes
pg_ls_dir() a simple wrapper, so if that's pursued, they would behave the same
unless I add another flag to do otherwise (but behaving the same has its
merits). It already uses lstat() to show links to dirs as isdir=no, which was
needed to avoid recursing into links-to-dirs in the new helper function
pg_ls_dir_recurse(). https://commitfest.postgresql.org/26/2377/

[0] Which you fixed in 085b6b667 and I previously fixed at:
https://www.postgresql.org/message-id/attachment/106478/v2-0001-BUG-in-errmsg.patch
|$ sudo ln -s foo /var/log/postgresql/bar
|ts=# SELECT * FROM pg_ls_logdir() ORDER BY 3;
|ERROR: could not stat directory "/var/log/postgresql": No such file or directory

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-03-28 18:43:23 Re: pgsql: Add kqueue(2) support to the WaitEventSet API.
Previous Message Andres Freund 2020-03-28 18:34:34 Re: pgbench - refactor init functions with buffers