DROP DATABASE doesn't force other backends to close FDs

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, joerg(at)NetBSD(dot)org
Subject: DROP DATABASE doesn't force other backends to close FDs
Date: 2018-10-03 22:37:25
Message-ID: 20181003223725.elcu3t44fpd4lm56@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Joerg reported on IRC that after a DROP DATABASE the space of the
dropped database wasn't available to the OS until he killed a session
that was active from before the drop. I was kinda incredulous at first,
the code looks sane... Thomas figured out significant parts of this.

dropdb() does a fairly careful dance to ensure processes close FDs for
about-to-be-removed files:
* Drop pages for this database that are in the shared buffer cache. This
* is important to ensure that no remaining backend tries to write out a
* dirty buffer to the dead database later...

* Tell the stats collector to forget it immediately, too.

* Tell checkpointer to forget any pending fsync and unlink requests for
* files in the database; else the fsyncs will fail at next checkpoint, or
* worse, it will delete files that belong to a newly created database
* with the same OID.

* Force a checkpoint to make sure the checkpointer has received the
* message sent by ForgetDatabaseFsyncRequests. On Windows, this also
* ensures that background procs don't hold any open files, which would
* cause rmdir() to fail.

* Remove all tablespace subdirs belonging to the database.

By my code reading that ensures that FDs held open by checkpointer and
bgwriter are closed in a relatively timely fashion (although I don't
think there's a nice time-bound guarantee for bgwriter, which could be
problematic for windows).

That unfortunately disregards that normal backends could have plenty
files open for the to-be-dropped database, if there's any sort of
shared_buffer pressure. Whenever a backend writes out a dirty victim
page belonging to another database, it'll also open files therein.

I thought that'd not be a problem because surely we do a smgrclose() at
some proper point to avoid that issue. But we don't.

I don't quite know how to trivially [1] guarantee that an invalidation
we send out is processed by all backends before continuing. But even if
we just send out an invalidation to all backends that instruct them to
do an smgrcloseall() the next time they do invalidation processing, we'd
be much better off than what are now (where we'll never close those
files until there's a lot of fd pressure or the backend exits).

I wonder if this is responsible for some issues we had with windows in
the past...

[1] I think we need something like an acknowledge protocol for a few
things, most recently onlining enabling checksums anyway. What I
had in mind is simple:
1) one global 64bit request counter in shared memory
2) per-backend counter in shared memory, that acknowledge up to
which point requests have been processed.
3) per-purpose procsignal that request e.g. that checksums get
enabled, that smgrcloseall has been processed, etc.
4) Function that sends out such a signal to all backends and then
waits till all such requests have been processed.

This obviously shouldn't be used for very frequent occurances, but
the use-cases I have in mind should never be that.


Andres Freund


Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-10-03 22:55:33 Re: Skylake-S warning
Previous Message Daniel Wood 2018-10-03 21:29:39 Skylake-S warning