On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> (If the query ended up being a seqscan, I'd expect a second
>>> lseek(SEEK_END) when the executor starts up, but I gather from the other
>>> complaints that the mosbench people were only testing simple indexscan
>> Yeah, it seems that for a sequential scan we lseek the heap, then the
>> index, then the heap again; but for index scans we just hit the heap
>> and the index.
> Sure. The first two come from the planner getting the table and index
> sizes for estimation purposes (look in plancat.c). The last is done in
> heapam.c's initscan(). We could possibly accept stale values for the
> planner estimates, but I think heapam's number had better be accurate.
I think the exact requirement is that, if the relation turns out to be
larger than the size we read, the extra blocks had better not contain
any tuples our snapshot can see. There's actually no interlock
between smgrnblocks() and smgrextend() right now, so presumably we
don't need to add one. However, a value cached from a few seconds ago
is clearly not going to cut it.
I don't really think there's any sensible way to implement a
per-backend cache, because that would require invalidation events of
some kind to be sent on relation extension, and that seems utterly
insane from a performance standpoint, even if we invented something
less expensive than sinval. I guess it might work for planning
purposes if you only sent out invalidation events on every N'th
extension or something, but penalizing the accuracy of planning to
work around a Linux kernel bug that only manifests itself on machines
with >32 cores doesn't seem very appealing.
A shared cache seems like it could work, but the locking is tricky.
Normally we'd just use a hash table protected by an LWLock, one one
LWLock per partition, but here that's clearly not going to work. The
kernel is using a spinlock per file, and that's still too
heavy-weight. I think that if we could prepopulate the cache with all
the keys (i.e. relfilenodes) we care about and then never add or evict
any, we could run it completely unlocked. I believe that the existing
memory barriers in things like LWLockAcquire() would be sufficient to
prevent us from reading a too-old value (i.e. block count). In
particular, you couldn't read a value that predated your snapshot, if
you got your snapshot by holding ProcArrayLock. But the races
involved with adding and removing items from the cache are hard to
deal with without using locks, especially because the keys are 12
bytes or more and therefore can't be read or written atomically.
I've been mulling over how we might deal with this and actually coded
up an implementation, but it turns out (surprise, surprise) to have
problems with insufficient locking. So I'm thinking it over some
more. And hoping that the Linux guys decide to do something about it.
This isn't really our bug - lseek is quite cheap in the uncontended
The Enterprise PostgreSQL Company
In response to
pgsql-hackers by date
|Next:||From: Pavan Deolasee||Date: 2011-08-03 21:10:53|
|Subject: Re: [GENERAL] Odd VACUUM behavior when it is expected to truncate
last empty pages|
|Previous:||From: Simon Riggs||Date: 2011-08-03 20:44:37|
|Subject: Locking end of indexes during VACUUM|