Re: New WAL record to detect the checkpoint redo location

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New WAL record to detect the checkpoint redo location
Date: 2023-10-09 20:47:02
Message-ID: 20231009204702.cy3ilp6wgh7a5qms@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

As noted in my email from a few minutes ago, I agree that optimizing this
shouldn't be a requirement for merging the patch.

On 2023-10-09 15:58:36 -0400, Robert Haas wrote:
> 1. The reason why we're doing this multiplication and division is to
> make sure that the code in ReserveXLogInsertLocation which executes
> while holding insertpos_lck remains as simple and brief as possible.
> We could eliminate the conversion between usable byte positions and
> LSNs if we replaced Insert->{Curr,Prev}BytePos with LSNs and had
> ReserveXLogInsertLocation work out by how much to advance the LSN, but
> it would have to be worked out while holding insertpos_lck (or some
> replacement lwlock, perhaps) and that cure seems worse than the
> disease. Given that, I think we're stuck with converting between
> usable bye positions and LSNs, and that intrinsically needs some
> multiplication and division.

Right, that's absolutely crucial for scalability.

> 2. It seems possible to remove one branch in each of
> XLogBytePosToRecPtr and XLogBytePosToEndRecPtr. Rather than testing
> whether bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD, we could simply
> increment bytesleft by SizeOfXLogLongPHD - SizeOfXLogShortPHD. Then
> the rest of the calculations can be performed as if every page in the
> segment had a header of length SizeOfXLogShortPHD, with no need to
> special-case the first page. However, that doesn't get rid of any
> multiplication or division, just a branch.

This reminded me about something I've been bugged by for a while: The whole
idea of short xlog page headers seems like a completely premature
optimization. The page header is a very small amount of the overall data
(long: 40/8192 ~= 0.00488, short: 24/8192 ~= 0.00292), compared to the space
we waste in many other places, including on a per-record level, it doesn't
seem worth the complexity.

> 3. Aside from that, there seems to be no simple way to reduce the
> complexity of an individual calculation, but ReserveXLogInsertLocation
> does perform 3 rather similar computations, and I believe that we know
> that it will always be the case that *PrevPtr < *StartPos < *EndPos.
> Maybe we could have a fast-path for the case where they are all in the
> same segment. We could take prevbytepos modulo UsableBytesInSegment;
> call the result prevsegoff. If UsableBytesInSegment - prevsegoff >
> endbytepos - prevbytepos, then all three pointers are in the same
> segment, and maybe we could take advantage of that to avoid performing
> the segment calculations more than once, but still needing to repeat
> the page calculations. Or, instead or in addition, I think we could by
> a similar technique check whether all three pointers are on the same
> page; if so, then *StartPos and *EndPos can be computed from *PrevPtr
> by just adding the difference between the corresponding byte
> positions.

I think we might be able to speed some of this up by pre-compute values so we
can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we
already insist on power-of-two segment sizes, so instead of needing to divide
by a runtime value, we should be able to shift by a runtime value (and the
modulo should be a mask).

> I'm not really sure whether that would come out cheaper. It's just the
> only idea that I have. It did also occur to me to wonder whether the
> apparent delays performing multiplication and division here were
> really the result of the arithmetic itself being slow or whether they
> were synchronization-related, SpinLockRelease(&Insert->insertpos_lck)
> being a memory barrier just before. But I assume you thought about
> that and concluded that wasn't the issue here.

I did verify that they continue to be a bottleneck even after (incorrectly
obviously), removing the spinlock. It's also not too surprising, the latency
of 64bit divs is just high, particularly on intel from a few years ago (my
cascade lake workstation) and IIRC there's just a single execution port for it
too, so multiple instructions can't be fully parallelized.

https://uops.info/table.html documents a worst case latency of 89 cycles on
cascade lake, with the division broken up into 36 uops (reducing what's
available to track other in-flight instructions). It's much better on alter
lake (9 cycles and 7 uops on the perf cores, 44 cycles and 4 uops on
efficiency cores) and on zen 3+ (19 cycles, 2 uops).

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2023-10-09 20:48:23 Re: should frontend tools use syncfs() ?
Previous Message David G. Johnston 2023-10-09 20:34:47 Re: Fix output of zero privileges in psql