On 09.01.2012 15:44, Simon Riggs wrote:
> On Sat, Jan 7, 2012 at 9:31 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Anyway, here's a new version of the patch. It no longer busy-waits for
>> in-progress insertions to finish, and handles xlog-switches. This is now
>> feature-complete. It's a pretty complicated patch, so I would appreciate
>> more eyeballs on it. And benchmarking again.
> Took me awhile to understand why the data structure for the insertion
> slots is so complex. Why not have slots per buffer? That would be
> easier to understand and slots are very small.
Hmm, how would that work?
> Can we avoid having spinlocks on the slots altogether? If we have a
> page number (int) and an LSN, inserters would set LSN and then set
> page number. Anybody waiting for slots would stop if page number is
> zero since that means its not complete yet. So readers look at page
> number first and aren't allowed to look at LSN without valid page
The LSN on a slot is set in ReserveXLogInsertLocation(), while holding
the insertpos_lck spinlock. The inserter doesn't acquire the per-slot
spinlock at that point, it relies on the fact that no-one will look at
the slot until the shared nextslot variable has been incremented. The
spinlock is only acquired when updating the pointer, which only happens
when crossing a WAL page, which isn't that performance-criticial, and
when the insertion is finished. It would be nice to get rid of the
spinlock acquisition when the insertion is finished, but I don't see any
easy way around that. The spinlock is needed to make sure that when the
inserter clears its slot, it can atomically check the waiter field.
The theory is that contention on those per-slot spinlocks is very rare.
Profiling this patch with "perf", it looks like the bottleneck is the
> Page number would be useful in working out where to stop when doing
> background flushes, which we need for Group Commit, which is arriving
> soon for this release.
Ok, I'll have to just take your word on that :-). I don't see why Group
Commit needs to care about page boundaries, but the slot data structure
I used already allows you to check fairly how far you write out the WAL
without having to wait for any in-progress insertions to complete.
> Can we also try aligning the actual insertions onto cache lines rather
> than just MAXALIGNing them? The WAL header fills half a cache line as
> it is, so many other records will fit nicely. I'd like to see what
> that does to space consumption, but it might be a useful option at
Hmm, that's an interesting thought. That would mean having gaps in the
in-memory WAL cache, so that when it's written out, you'd need to stitch
together the pieces to form the WAL that's actually written to disk. Or
just leave the gaps in the on-disk format, if we're willing to change
the WAL format for this, but I don't think we want make our WAL any
larger than it already is.
I've written this patch avoiding WAL format changes, but if we're
willing to do that, there's a few things we could do that would help.
For one, the logic in ReserveXLogInsertLocation() that figures out where
in the WAL stream the record begins and where it ends could be made a
lot simpler. At the moment, we refuse to split a WAL record header
across WAL pages, and because of that, the number of bytes occupied by a
WAL record depends on where in the WAL it's written. If we changed that,
reserving space from the WAL for a record that's N bytes long could be
done essentially as "CurrPos += N". There's some complications, like
having to keep track of the prev-link too, but I believe it would be
possible to get rid of the spinlock and implement
ReserveXLogInsertLocation() as a single atomic fetch-and-add instruction.
> GetInsertRecPtr() should return the XLogRecPtr of the latest
> allocation. IMHO that is what we need for checkpoints and the
> walsender doesn't really matter.
Ok. Thanks looking at the patch!
In response to
pgsql-hackers by date
|Next:||From: Andrew Dunstan||Date: 2012-01-09 15:35:06|
|Subject: Re: pgsql: plpython: Add SPI cursor support|
|Previous:||From: Ryan Kelly||Date: 2012-01-09 13:49:25|
|Subject: Re: [PATCH] Allow breaking out of hung connection