Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Date: 2020-02-14 16:20:15
Message-ID: CA+TgmoauGmGkqJLuE-VZGBeGSo0UcSuJ-CjBjV-1UCYtJRehcA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 14, 2020 at 10:43 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I remain unconvinced ... wouldn't both of those claims apply to any disk
> I/O request? Are we going to try to ensure that no I/O ever happens
> while holding an LWLock, and if so how? (Again, CheckpointLock is a
> counterexample, which has been that way for decades without reported
> problems. But actually I think buffer I/O locks are an even more
> direct counterexample.)

Yes, that's a problem. I proposed a patch a few years ago that
replaced the buffer I/O locks with condition variables, and I think
that's a good idea for lots of reasons, including this one. I never
quite got around to pushing that through to commit, but I think we
should do that. Aside from fixing this problem, it also prevents
certain scenarios where we can currently busy-loop.

I do realize that we're unlikely to ever solve this problem
completely, but I don't think that should discourage us from making
incremental progress. Just as debuggability is a sticking point for
you, what I'm going to call operate-ability is a sticking point for
me. My work here at EnterpriseDB exposes me on a fairly regular basis
to real broken systems, and I'm therefore really sensitive to the
concerns that people have when trying to recover after a system has
become, for one reason or another, really broken.

Interruptibility may not be the #1 concern in that area, but it's very
high on the list. EnterpriseDB customers, as a rule, *really* hate
being told to restart the database because one session is stuck. It
causes a lot of disruption for them and the person who does the
restart gets yelled at by their boss, and maybe their bosses boss and
the boss above that. It means that their whole application, which may
be mission-critical, is down until the database finishes restarting,
and that is not always a quick process, especially after an immediate
shutdown. I don't think we can ever make everything that can get stuck
interruptible, but the more we can do the better.

The work you and others have done over the years to add
CHECK_FOR_INTERRUPTS() to more places pays real dividends. Making
sessions that are blocked on disk I/O interruptible in at least some
of the more common cases would be a huge win. Other people may well
have different experiences, but my experience is that the disk
deciding to conk out for a while or just respond very very slowly is a
very common problem even (and sometimes especially) on very expensive
hardware. Obviously that's not great and you're in lots of trouble,
but being able to hit ^C and get control back significantly improves
your chances of being able to understand what has happened and recover
from it.

> > That's an interesting idea, but it doesn't make the lock acquisition
> > itself interruptible, which seems pretty important to me in this case.
>
> Good point: if you think the contained operation might run too long to
> suit you, then you don't want other backends to be stuck behind it for
> the same amount of time.

Right.

> > I wonder if we could have an LWLockAcquireInterruptibly() or some such
> > that allows the lock acquisition itself to be interruptible. I think
> > that would require some rejiggering but it might be doable.
>
> Yeah, I had the impression from a brief look at LWLockAcquire that
> it was itself depending on not throwing errors partway through.
> But with careful and perhaps-a-shade-slower coding, we could probably
> make a version that didn't require that.

Yeah, that was my thought, too, but I didn't study it that carefully,
so somebody would need to do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-02-14 16:21:37 Re: allow frontend use of the backend's core hashing functions
Previous Message Mark Dilger 2020-02-14 16:16:47 Re: allow frontend use of the backend's core hashing functions