Re: Btree runtime recovery. Stuck spins.

From: "Vadim Mikheev" <vmikheev(at)sectorbase(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Btree runtime recovery. Stuck spins.
Date: 2001-02-09 20:48:46
Message-ID: 00cc01c092d9$b2f17120$4c79583f@sectorbase.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > Hm. It was OK to use spinlocks to control buffer access when the max
> > delay was just the time to read or write one disk page. But it sounds

Actually, btree split requires 3 simult. buffers locks and after that
_bt_getstackbuf may read *many* parent buffers while holding locks on
2 buffers. AFAIR, the things are even worse in hash.

And anyway, there is always probability that someone else will get just
freed lock while you're waiting next 0.01 sec. The problem is that
there is no priority/ordering while waiting for spin lock.

> > like we've pushed the code way past what it was designed to do. I think
> > this needs some careful thought, not just a quick hack like increasing
> > the timeout interval.

I fear there is not enough time -:(

> After thinking more about this, simply increasing S_MAX_BUSY is clearly
> NOT a good answer. If you are under heavy load then processes that are
> spinning are making things worse, not better, because they are sucking
> CPU cycles that would be better spent on the processes that are holding
> the locks.
>
> It would not be very difficult to replace the per-disk-buffer spinlocks
> with regular lockmanager locks. Advantages:
> * Processes waiting for a buffer lock aren't sucking CPU cycles.
> * Deadlocks will be detected and handled reasonably. (The more stuff
> that WAL does while holding a buffer lock, the bigger the chances
> of deadlock. I think this is a significant concern now.)

I disagree. Lmgr needs in deadlock detection code because of deadlock
may be caused by *user application* design and we must not count on
*user application* correctness. But we must not use deadlock detection
code when we protected from deadlock by *our* design. Well, anyone can
make mistake and break order of lock acquiring - we should just fix
those bugs -:)
So, it doesn't matter *how much stuff that WAL does while holding buffer
locks* as long as WAL itself doesn't acquire buffer locks.

> Of course the major disadvantage is:
> * the lock setup/teardown overhead is much greater than for a
> spinlock, and the overhead is just wasted when there's no contention.

Exactly.

> A reasonable alternative would be to stick with the spinlock mechanism,
> but use a different locking routine (maybe call it S_SLOW_LOCK) that is
> designed to deal with locks that may be held for a long time. It would
> use much longer delay intervals than the regular S_LOCK code, and would
> have either a longer time till ultimate timeout, or no timeout at all.
> The main problem with this idea is choosing an appropriate timeout
> behavior. As I said, I am concerned about the possibility of deadlocks
> in WAL-recovery scenarios, so I am not very happy with the thought of
> no timeout at all. But it's hard to see what a reasonable timeout would

And I'm unhappy with timeouts -:) It's not solution at all. We should
do right design instead.

> be if a minute or more isn't enough in your test cases; seems to me that
> that suggests that for very large indexes, you might need a *long* time.
>
> Comments, preferences, better ideas?

For any spins which held while doing IO ops we should have queue of
waiting backend' PROCs. As I said - some kind of lightweight lock manager.
Just two kind of locks - shared & exclusive. No structures to find locked
objects. No deadlock detection code. Backends should wait on their
semaphores, without timeouts.

For "true" spins (held for really short time when accessing control
structures in shmem) we should not sleep 0.01 sec! tv_usec == 1 would be
reasonable - just to yield CPU. Actually, mutexes would be much better...

Vadim

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2001-02-09 22:26:35 Fix for large objects
Previous Message Nathan Myers 2001-02-09 19:44:05 Re: Btree runtime recovery. Stuck spins.