Re: mosbench revisited

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: mosbench revisited
Date: 2011-08-03 21:35:57
Message-ID: 23924.1312407357@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> ... We could possibly accept stale values for the
>> planner estimates, but I think heapam's number had better be accurate.

> I think the exact requirement is that, if the relation turns out to be
> larger than the size we read, the extra blocks had better not contain
> any tuples our snapshot can see. There's actually no interlock
> between smgrnblocks() and smgrextend() right now, so presumably we
> don't need to add one.

No interlock in userspace, you mean. We're relying on the kernel to do
it, ie, give us a number that is not older than the time of our (already
taken at this point) snapshot.

> I don't really think there's any sensible way to implement a
> per-backend cache, because that would require invalidation events of
> some kind to be sent on relation extension, and that seems utterly
> insane from a performance standpoint, even if we invented something
> less expensive than sinval.

Yeah, that's the issue. But "relation extension" is not actually a
cheap operation, since it requires a minimum of one kernel call that is
presumably doing something nontrivial in the filesystem. I'm not
entirely convinced that we couldn't make this work --- especially since
we could certainly derate the duty cycle by a factor of ten or more
without giving up anything remotely meaningful in planning accuracy.
(I'd be inclined to make it send an inval only once the relation size
had changed at least, say, 10%.)

> A shared cache seems like it could work, but the locking is tricky.
> Normally we'd just use a hash table protected by an LWLock, one one
> LWLock per partition, but here that's clearly not going to work. The
> kernel is using a spinlock per file, and that's still too
> heavy-weight.

That still seems utterly astonishing to me. We're touching each of
those files once per query cycle; a cycle that contains two message
sends, who knows how many internal spinlock/lwlock/heavyweightlock
acquisitions inside Postgres (some of which *do* contend with each
other), and a not insignificant amount of plain old computing.
Meanwhile, this particular spinlock inside the kernel is protecting
what, a single doubleword fetch? How is that the bottleneck?

I am wondering whether kernel spinlocks are broken.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-08-03 21:57:35 Re: Locking end of indexes during VACUUM
Previous Message Alvaro Herrera 2011-08-03 21:26:58 Re: cataloguing NOT NULL constraints