Skip site navigation (1) Skip section navigation (2)

Re: mosbench revisited

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: mosbench revisited
Date: 2011-08-03 21:35:57
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> ... We could possibly accept stale values for the
>> planner estimates, but I think heapam's number had better be accurate.

> I think the exact requirement is that, if the relation turns out to be
> larger than the size we read, the extra blocks had better not contain
> any tuples our snapshot can see.  There's actually no interlock
> between smgrnblocks() and smgrextend() right now, so presumably we
> don't need to add one.

No interlock in userspace, you mean.  We're relying on the kernel to do
it, ie, give us a number that is not older than the time of our (already
taken at this point) snapshot.

> I don't really think there's any sensible way to implement a
> per-backend cache, because that would require invalidation events of
> some kind to be sent on relation extension, and that seems utterly
> insane from a performance standpoint, even if we invented something
> less expensive than sinval.

Yeah, that's the issue.  But "relation extension" is not actually a
cheap operation, since it requires a minimum of one kernel call that is
presumably doing something nontrivial in the filesystem.  I'm not
entirely convinced that we couldn't make this work --- especially since
we could certainly derate the duty cycle by a factor of ten or more
without giving up anything remotely meaningful in planning accuracy.
(I'd be inclined to make it send an inval only once the relation size
had changed at least, say, 10%.)

> A shared cache seems like it could work, but the locking is tricky.
> Normally we'd just use a hash table protected by an LWLock, one one
> LWLock per partition, but here that's clearly not going to work.  The
> kernel is using a spinlock per file, and that's still too
> heavy-weight.

That still seems utterly astonishing to me.  We're touching each of
those files once per query cycle; a cycle that contains two message
sends, who knows how many internal spinlock/lwlock/heavyweightlock
acquisitions inside Postgres (some of which *do* contend with each
other), and a not insignificant amount of plain old computing.
Meanwhile, this particular spinlock inside the kernel is protecting
what, a single doubleword fetch?  How is that the bottleneck?

I am wondering whether kernel spinlocks are broken.

			regards, tom lane

In response to


pgsql-hackers by date

Next:From: Tom LaneDate: 2011-08-03 21:57:35
Subject: Re: Locking end of indexes during VACUUM
Previous:From: Alvaro HerreraDate: 2011-08-03 21:26:58
Subject: Re: cataloguing NOT NULL constraints

Privacy Policy | About PostgreSQL
Copyright © 1996-2018 The PostgreSQL Global Development Group