Re: XLogInsert scaling, revisited

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XLogInsert scaling, revisited
Date: 2013-06-26 15:52:30
Message-ID: 51CB0E3E.5060605@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24.06.2013 21:01, Andres Freund wrote:
> Ok, I started to look at this:

Thanks!

> * Could you document the way slots prevent checkpoints from occurring
> when XLogInsert rechecks for full page writes? I think it's correct -
> but not very obvious on a glance.

There's this in the comment near the top of the file:

* To update RedoRecPtr or fullPageWrites, one has to make sure that all
* subsequent inserters see the new value. This is done by reserving
all the
* insertion slots before changing the value. XLogInsert reads
RedoRecPtr and
* fullPageWrites after grabbing a slot, so by holding all the slots
* simultaneously, you can ensure that all subsequent inserts see the new
* value. Those fields change very seldom, so we prefer to be fast and
* non-contended when they need to be read, and slow when they're changed.

Does that explain it well enough? XLogInsert holds onto a slot while it
rechecks for full page writes.

> * The read of Insert->RedoRecPtr while rechecking whether it's out of
> date now is unlocked, is that correct? And why?

Same here, XLogInsert holds the slot while rechecking Insert->RedoRecptr.

> * Have you measured whether it works to just keep as many slots as
> max_backends requires around and not bothering with dynamically
> allocating them to inserters?
> That seems to require to keep actually waiting slots in a separate
> list which very well might make that too expensive.
>
> Correctness wise the biggest problem for that probably is exlusive
> acquiration of all slots CreateCheckpoint() does...

Hmm. It wouldn't be much different, each backend would still need to
reserve its own dedicated slot, because it might be held by the a
backend that grabbed all the slots. Also, bgwriter and checkpointer need
to flush the WAL, so they'd need slots too.

More slots make WaitXLogInsertionsToFinish() more expensive, as it has
to loop through all of them. IIRC some earlier pgbench tests I ran
didn't show much difference in performance, whether there were 40 slots
or 100, as long as there was enough of them. I can run some more tests
with even more slots to see how it behaves.

> * What about using some sort of linked list of unused slots for
> WALInsertSlotAcquireOne? Everytime we're done we put it to the *end*
> of the list so it's less likely to have been grabbed by somebody else
> so we can reuse it.
> a) To grab a new one go to the head of the linked list spinlock it and
> recheck whether it's still free. If not, restart. Otherwise, remove
> from list.
> b) To reuse a previously used slot
>
> That way we only have to do the PGSemaphoreLock() dance if there
> really aren't any free slots.

That adds a spinlock acquisition / release into the critical path, to
protect the linked list. I'm trying very hard to avoid serialization
points like that.

While profiling the tests I've done, finding a free slot hasn't shown up
much. So I don't think it's a problem performance-wise as it is, and I
don't think it would make the code simpler.

> * The queuing logic around slots seems to lack documentation. It's
> complex enough to warrant that imo.

Yeah, it's complex. I expanded the comments above XLogInsertSlot, to
explain how xlogInsertingAt works. Did that help, or was it some other
part of that that needs more docs?

> * Not a big fan of the "holdingAll" variable name, for a file-global one
> that seems a bit too generic.

Renamed to holdingAllSlots.

> * Could you add a #define or comment for the 64 used in
> XLogInsertSlotPadded? Not everyone might recognize that immediately as
> the most common cacheline size.
> Commenting on the reason we pad in general would be a good idea as
> well.

Copy-pasted and edited the explanation from LWLockPadded for that. I
also changed the way it's allocated so that it's aligned at 64-byte
boundary, like we do for lwlocks.

> * Is it correct that WALInsertSlotAcquireOne() resets xlogInsertingAt of
> all slots *before* it has acquired locks in all of them? If yes, why
> haven't we already reset it to something invalid?

I didn't understand this part. Can you elaborate?

> * Is GetXLogBuffer()'s unlocked read correct without a read barrier?

Hmm. I thought that it was safe because GetXLogBuffer() handles the case
that you get a "torn read" of the 64-bit XLogRecPtr value. But now that
I think of it, I wonder if it's possible for reads/writes to be
reordered so that AdvanceXLInsertBuffer() overwrites WAL data that
another backend has copied onto a page. Something like this:

1. Backend B zeroes a new WAL page, and sets its address in xlblocks
2. Backend A calls GetXLogBuffer(), and gets a pointer to that page
3. Backend A copies the WAL data to the page.

There is no lock acquisition in backend A during those steps, so I think
in theory the writes from step 3 might be reordered to happen before
step 1, so that that step 1 overwrites the WAL data written in step 3.
It sounds crazy, but after reading README.barrier, I don't see anything
that guarantees it won't happen in the weaker memory models.

To be safe, I'll add a full memory barrier to GetXLogBuffer(), and rerun
the benchmarks.

> * XLogBytePosToEndRecPtr() seems to have a confusing name to me. At
> least the comment needs to better explain that it's named that way
> because of the way it's used.

Ok, added a sentence on that. Let me know if that helped or if you have
better suggestions.

> Also, doesn't
> seg_offset += fullpages * XLOG_BLCKSZ + bytesleft + SizeOfXLogShortPHD;
> potentially point into the middle of a page?

Yes. seg_offset is the byte offset from the beginning of the segment,
it's supposed to point in the middle of the page.

> * I wish we could get rid of the bytepos notion, it - while rather
> clever - complicates things. But that's probably not going to happen
> unless we get rid of short/long page headers :/

Yeah. Fortunately its use is quite isolated.

I've attached a new version of the patch, with some additional comments
as mentioned in the above paragraphs. And the memory barrier.

- Heikki

Attachment Content-Type Size
xloginsert-scale-25.patch.gz application/x-gzip 27.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Markus Wanner 2013-06-26 15:55:37 Re: Hash partitioning.
Previous Message Stephen Frost 2013-06-26 15:50:58 Re: A better way than tweaking NTUP_PER_BUCKET