Re: Fix DROP TABLESPACE on Windows with ProcSignalBarrier?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fix DROP TABLESPACE on Windows with ProcSignalBarrier?
Date: 2021-03-03 22:19:54
Message-ID: CA+hUKG+EFRSOKr_dFnCKwbCM=hF4+zHcdFO1Kh=DPvLJctu+7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 4, 2021 at 4:18 AM Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
> > On 1 Mar 2021, at 12:54, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> Based on my (limited) experience with procsignalbarriers I think this patch is

Help wanted: must have at least 14 years experience with
ProcSignalBarrier! Yeah, I'm still figuring out the programming rules
here myself...

> correct; the general rule-of-thumb of synchronizing backend state on barrier
> absorption doesn't really apply in this case, literally all we want is to know
> that we've hit one interrupt and performed removals.

I guess the way to think about it is that the desired state is "you
have no files open that have been unlinked".

> >> +#if defined(WIN32) || defined(USE_ASSERT_CHECKING)
> >>
> >> Is the USE_ASSERT_CHECKING clause to exercise the code a more frequent than
> >> just on Windows? That could warrant a quick word in the comment if so IMO to
> >> avoid confusion.
> >
> > Note added.
>
> Since there is no way to get make the first destroy_tablespace_directories call
> fail on purpose in testing, the assertion coverage may have limited use though?

There is: all you have to do is drop a table, and then drop the
tablespace that held it without a checkpoint in between. That
scenario is exercised by the "tablespace" regression test, and you can
reach it manually like this on a Unix system, with assertions enabled.
On a Windows box, I believe it should be reached even if there was a
checkpoint in between (or maybe you need to have a second session that
has accessed the table, not sure, no actual Windows here I just fling
stuff at CI). I've added an elog() message to show the handler
running in each process in my cluster, so you can see it (it's also
instructive to put a sleep in there):

My psql session:

postgres=# create tablespace ts location '/tmp/ts';
CREATE TABLESPACE
postgres=# create table t () tablespace ts;
CREATE TABLE
postgres=# drop table t;
DROP TABLE
postgres=# drop tablespace ts;

At this point the log shows:

2021-03-04 09:54:33.429 NZDT [239811] LOG: ProcessBarrierSmgrRelease()
2021-03-04 09:54:33.429 NZDT [239821] LOG: ProcessBarrierSmgrRelease()
2021-03-04 09:54:33.429 NZDT [239821] STATEMENT: drop tablespace ts;
2021-03-04 09:54:33.429 NZDT [239814] LOG: ProcessBarrierSmgrRelease()
2021-03-04 09:54:33.429 NZDT [239816] LOG: ProcessBarrierSmgrRelease()
2021-03-04 09:54:33.429 NZDT [239812] LOG: ProcessBarrierSmgrRelease()
2021-03-04 09:54:33.429 NZDT [239813] LOG: ProcessBarrierSmgrRelease()

Now back to my session:

DROP TABLESPACE
postgres=#

> I don't have a Windows env handy right now, but everything works as expected
> when testing this on Linux and macOS by inducing the codepath. Will try to do
> some testing in Windows as well.

Thanks!

One question on my mind is: since this wait is interruptible (if you
get sick of waiting for a slow-to-respond process you can hit ^C, or
statement_timeout can presumably do it for you), do we leave things in
a sane state on error (catalog changes rolled back, no damage done on
disk)? There is actually a nasty race there already ("If we crash
before committing..."), and we need to make sure we don't make that
window wider. One thing I am pretty sure of is that it's never OK to
wait for a ProcSignalBarrier when you're not interruptible; for one
thing, you won't process the request yourself (self deadlock) and for
another, it would be hypocritical of you to expect others to process
interrupts when you can't (interprocess deadlock); perhaps there
should be an assertion about that, but it's pretty obvious if you
screw that up: it hangs. That's why I release and reacquire that
LWLock. But does that break some logic?

Andres just pointed me at the following CI failure on the AIO branch,
which seems to be due to a variant of this problem involving DROP
DATABASE.

https://cirrus-ci.com/task/6730034573475840?command=windows_worker_buf#L7

Duh, of course, we need the same thing in that case, and also in its
redo routine.

And... the same problem must also exist for the closely related ALTER
DATABASE ... SET TABLESPACE. I guess these cases are pretty unlikely
to fail without the AIO branch's funky "io worker" processes that love
hoarding file descriptors, but I suppose it must be possible for the
bgwriter to have a relevant file descriptor open at the wrong time on
master today.

One thing I haven't tried to do yet is improve the "pipelining" by
issuing the request sooner, in the cases where we do this stuff
unconditionally.

Attachment Content-Type Size
v7-0001-Fix-DROP-DATABASE-TABLESPACE-on-Windows.patch text/x-patch 10.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-03-03 22:21:31 Re: Fix DROP TABLESPACE on Windows with ProcSignalBarrier?
Previous Message Justin Pryzby 2021-03-03 22:10:50 Re: [HACKERS] GSoC 2017: Foreign Key Arrays