Re: ALTER TABLE lock strength reduction patch is unsafe

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ALTER TABLE lock strength reduction patch is unsafe
Date: 2011-06-24 19:46:17
Message-ID: BANLkTik8KAbV3Ka5jgjHy5hpdFOF+fWmDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 23, 2011 at 8:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Jun 22, 2011 at 11:26 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> For people that still think there is still risk, I've added a
>> parameter called "relaxed_ddl_locking" which defaults to "off". That
>> way people can use this feature in an informed way without exposing us
>> to support risks from its use.
>
> I can't get excited about adding a GUC that says "you can turn this
> off if you want but don't blame us if it breaks".  That just doesn't
> seem like good software engineering to me.

Nobody is suggesting we fix the pre-existing bug - we're just going to
leave it. That doesn't sound like good software engineering either,
but I'm not going to complain because I understand.

We're in a bind here and I'm trying to suggest practical ways out, as
well as cater for the levels of risk people are willing to accept. I
don't think we need that parameter at all, but if you do then I'm
willing to tolerate it.

>> I think we should be clear that this adds *one* lock/unlock for each
>> object for each session, ONCE only after each transaction that
>> executed a DDL statement against an object. So with 200 sessions, a
>> typical ALTER TABLE statement will cause 400 locks. The current lock
>> manager maxes out above 50,000 locks per second, so any comparative
>> analysis will show this is a minor blip on even a heavy workload.
>
> I think that using locks to address this problem is the wrong
> approach.  I think the right fix is to use something other than
> SnapshotNow to do the system catalog scans.

I agree with that, but its a much bigger job than you think and I
really doubt that you will do it for 9.2, definitely not for 9.1.

There are so many call points, I would not bother personally to
attempt a such an critical and widely invasive fix.

So I'd put that down as a Blue Sky will-fix-one-day thing.

> However, if we were going
> to use locking, then we'd need to ensure that:
>
> (1) No transaction which has made changes to a system catalog may
> commit while some other transaction is in the middle of scanning that
> catalog.
> (2) No transaction which has made changes to a set of system catalogs
> may commit while some other transaction is in the midst of fetching
> interrelated data from a subset of those catalogs.
>
> It's important to note, I think, that the problem here doesn't occur
> at the time that the table rows are updated, because those rows aren't
> visible yet.  The problem happens at commit time.  So unless I'm
> missing something, ALTER TABLE would only need to acquire the relation
> definition lock after it has finished all of its other work.  In fact,
> it doesn't even really need to acquire it then, either.  It could just
> queue up a list of relation definition locks to acquire immediately
> before commit, which would be advantageous if the transaction went on
> to abort, since in that case we'd skip the work of acquiring the locks
> at all.

That would work if the only thing ALTER TABLE does is write. There are
various places where we read catalogs and all of those catalogs need
to be relationally integrated. So the only safe way, without lots of
incredibly detailed analysis (which you would then find fault with) is
to lock at the top and hold the lock throughout most of the
processing. That's the safe thing to do, at least.

I do already use the point you make in the patch, where it's used to
unlock before and then relock after the FK constraint check, so that
the RelationDefinition lock is never held for long periods in any code
path.

> In fact, without doing that, this patch would undo much of the point
> of the original ALTER TABLE lock strength reduction:
>
> begin;
> alter table foo alter column a set storage plain;
> <start a new psql session in another window>
> select * from foo;
>
> In master, the SELECT completes without blocking.  With this patch
> applied, the SELECT blocks, just as it did in 9.0, because it can't
> rebuild the relcache entry without getting the relation definition
> lock, which the alter table will hold until commit.

It's not the same at all. Yes, they are both locks; what counts is the
frequency and duration of locking.

With AccessExclusiveLock we prevent all SELECTs from accessing the
table while we queue for the lock, which can be blocked behind a long
running query. So the total outage to the application for one minor
change can be hours - unpredictably. (Which is why I never wrote the
ALTER TABLE set ndistinct myself, even though I suggested it, cos its
mostly unusable).

With the reduced locking level AND this patch the *rebuild* can be
blocked behind the DDL. But the DDL itself is not blocked in the same
way, so the total outage is much reduced and the whole set of actions
goes through fairly quickly.

If you only submit one DDL operation then the rebuild has nothing to
block against, so the whole thing goes through smoothly.

> In fact, the same
> thing happens even if foo has been accessed previously in the same
> session.  If there is already an open *transaction* that has accessed
> foo, then, it seems, it can continue to do so until it commits.  But
> as soon as it tries to start a new transaction, it blocks again.  I
> don't quite understand why that is - I didn't think we invalidated the
> entire relcache on every commit.

We don't invalidate the relcache on every commit.

> But that's what happens.

Test case please. I don't understand the problem you're describing.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-06-24 19:54:10 Re: FWD: fastlock+lazyvzid patch performance
Previous Message Robert Haas 2011-06-24 19:32:25 Re: heap_hot_search_buffer refactoring