Re: src/test/recovery regression failure on bionic

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: src/test/recovery regression failure on bionic
Date: 2020-01-09 00:18:04
Message-ID: 2581.1578529084@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> Is it worth having the test close superflous FDs? It'd not be hard to do
> so via brute force (or even going through /proc/self/fd).

No, it isn't, because d20703805's test is broken by design. There
are any number of reasons why there might be more than three-or-so
FDs open during postmaster start. Here are a few:

* It seems pretty likely that at least one of those FDs is
intentionally being left open by cron so it can detect death of
all child processes (like our postmaster death pipe). Forcibly
closing them will not necessarily have nice results. Other
execution environments might do similar tricks.

* On platforms where semaphores eat a FD apiece, we intentionally
open those before counting free FDs.

* We run process_shared_preload_libraries before counting free FDs,
too. If a loaded library intentionally leaves a FD open in the
postmaster, counting that against the limit also seems like a good
idea.

My opinion is still that we should just get rid of that test case.
The odds of it ever finding anything interesting seem too low to
justify the work that will be involved in (a) getting it to work
reliably today and then (b) keeping it working. Running on the
hairy edge of not having enough FDs doesn't seem like a use-case
that we want to spend a lot of time on, but we will be if that
test stays as it is. Example: if max_safe_fds is only 10, as
this test is trying to make it be, then maxAllocatedDescs won't
be allowed to exceed 5. Do you want to bet that no code paths
will reasonably exceed that limit? [1] What will we do about it
when we find one that does?

I also note that the test can only catch cases where we used
OpenTransientFile() in an inappropriate way. I think it's at
least as likely that somebody would carelessly use open()
directly, and then we have no chance of catching it till the
kernel complains about EMFILE.

Thinking about that some more, maybe the appropriate thing
to do is not to mess with max_files_per_process as such,
but to test with some artificial limit on maxAllocatedDescs.
We still won't catch direct use of open(), but that could test
for misuse of OpenTransientFile() with a lot less environment
dependence.

regards, tom lane

[1] Although I notice that the code coverage report shows we
never reach the enlargement step in reserveAllocatedDesc(),
which means that the set of tests we run today don't exercise
any such path. I'm somewhat surprised to see that we *do*
seem to exercise overrunning max_safe_fds, though, since
ReleaseLruFile() is reached. Maybe this test case is
responsible for that?

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Amit Kapila 2020-01-09 02:31:24 Re: src/test/recovery regression failure on bionic
Previous Message Andres Freund 2020-01-08 23:22:05 Re: src/test/recovery regression failure on bionic

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-01-09 02:07:20 Re: Removing pg_pltemplate and creating "trustable" extensions
Previous Message Vik Fearing 2020-01-08 23:55:37 Re: Recognizing superuser in pg_hba.conf