Re: Re: Buffer access rules, and a probable bug

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mikheev, Vadim" <vmikheev(at)SECTORBASE(dot)COM>
Cc: Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Buffer access rules, and a probable bug
Date: 2001-07-03 22:27:17
Message-ID: 26093.994199237@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Okay, on to the next concern. I've been thinking some more about the
restrictions needed to make the world safe for concurrent VACUUM.
I previously said:

> 5. To physically remove a tuple or compact free space on a page, one
> must hold a pin and an exclusive lock, *and* observe while holding the
> exclusive lock that the buffer's shared reference count is one (ie,
> no other backend holds a pin). If these conditions are met then no other
> backend can perform a page scan until the exclusive lock is dropped, and
> no other backend can be holding a reference to an existing tuple that it
> might expect to examine again. Note that another backend might pin the
> buffer (increment the refcount) while one is performing the cleanup, but
> it won't be able to to actually examine the page until it acquires shared
> or exclusive lock.

This is OK when considering a page in isolation, but it does not get the
job done when one is deleting related index tuples and heap tuples. It
seems to me that there *must* be some cross-page coupling to make that
work safely. Otherwise you could have this scenario:

1. Indexscanning process visits an index tuple, decides to access the
corresponding heap tuple, drops its lock on the index buffer page.

2. VACUUMing process visits the index buffer page and marks the index
tuple dead. (It won't try to delete the tuple yet, since it sees
the pin still held on the page by process #1, but it can acquire
exclusive lock and mark the tuple dead anyway.)

3. VACUUMing process is the first to acquire pin and lock on the heap
buffer page. It sees no other pin, so it deletes the tuple.

4. Indexscanning process finally acquires pin and lock on the heap page,
tries to access what is now a gone tuple. Ooops. (Even if we
made that not an error condition, it could be worse: what if a
third process already reused the line pointer for a new tuple?)

It does not help to postpone the actual cleaning of an index or heap
page until its own pin count drops to zero --- the problem here is that
an indexscanner has acquired a reference into a heap page from the
index, but does not yet hold a pin on the heap page to ensure that
the reference stays good. So we can't just postpone the cleaning of
the index page till it has pin count zero, we have to make the related
heap page(s)' cleanup wait for that to happen too.

I can think of two ways of guaranteeing that this problem cannot happen.

One is for an indexscanning process to retain its shared lock on the
index page until it has acquired at least a pin on the heap page.
This is very bad for concurrency --- it means that we'd typically
be holding indexpage shared locks for the time needed to read in a
randomly-accessed disk page. And it's very complicated, since we still
need all the other rules, plus the mechanism for postponing cleanups
until pin count goes to zero. It'd cause considerable changes to the
index access method API, too.

The other is to forget about asynchronous cleaning, and instead have the
VACUUM process directly do the wait for pin count zero, then clean the
index page. Then when it does the same for the heap page, we know for
sure there are no indexscanners in transit to the heap page. This would
be logically a lot simpler, it seems to me. Another advantage is that
we need only one WAL entry per cleaned page, not two (one for the
initial mark-dead step and one for the postponable compaction step),
and there's no need for an intermediate "gone but not forgotten" state
for index tuples.

We could implement this in pretty nearly the same way as the "mark for
cleanup" facility that you partially implemented awhile back:
essentially, the cleanup callback would send a signal or semaphore
increment to the waiting process, which would then try to acquire pin
and exclusive lock on the buffer. If it succeeded in observing pin
count 1 with exclusive lock, it could proceed with cleanup, else loop
back and try again. Eventually it'll get the lock. (It might take
awhile, but for a background VACUUM I think that's OK.)

What I'm wondering is if you had any other intended use for "mark for
cleanup" than VACUUM. The cheapest implementation would allow only
one process to be waiting for cleanup on a given buffer, which is OK
for VACUUM because we'll only allow one VACUUM at a time on a relation
anyway. But if you had some other uses in mind, maybe the code needs
to support multiple waiters.

Comments?

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Myers 2001-07-03 22:39:37 Re: [OT] Any major users of postgresql?
Previous Message Alex Pilosov 2001-07-03 22:16:08 RE: New SQL Datatype RECURRINGCHAR