Quick Links

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From:	Lukas Fittl <lukas(at)fittl(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	John Naylor <johncnaylorls(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Hannu Krosing <hannuk(at)google(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, David Geier <geidav(dot)pg(at)gmail(dot)com>
Subject:	Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date:	2026-04-02 23:16:13
Message-ID:	CAP53PkwNwbEKj4=qZ+szimpAQYToNB2iqXSV2p99J6ZoiTRarQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Andres,

On Wed, Apr 1, 2026 at 5:54 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> Lukas, could you add some preliminary Reviewed-by tags to the
> not-yet-committed commitse?
>

Sure - I've gone back through the thread and added preliminary review
tags for anyone who tested the patch/provided code feedback - if
anyone was left out incorrectly, my advance apologies. I've been
liberal in adding "(in an earlier version)" since I don't want to
claim that someone reviewed the latest variant incorrectly.

Also I wasn't sure how you wanted to deal with the mix of authorship
and reviews the three of us (Andres, David, Lukas) have done over the
lifetime of this thread, so I tried to put what feels right in terms
of which author to list first, and also listed ourselves as reviewers.
e.g. David and you were the ones who originally proposed the ticks
scaling logic, so I now put you first as authors on the commit that
introduces pg_ticks_to_ns (0001). For the TSC patch I've driven most
of the current lines of code including the GUC handling, so I put
myself first. FWIW, I'm certainly fine with putting my name in a
different order on one of the commits if that doesn't seem accurate,
since I think we all just want to see this in Postgres one way or the
other.

> > +/*
> > + * Stores what the number of ticks needs to be multiplied with to end up
> > + * with nanoseconds using integer math.
> > + *
> > + * On certain platforms (currently Windows) the ticks to nanoseconds conversion
> > + * requires floating point math because:
> > + *
> > + * sec = ticks / frequency_hz
> > + * ns = ticks / frequency_hz * 1,000,000,000
> > + * ns = ticks * (1,000,000,000 / frequency_hz)
> > + * ns = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
> > + *
> > + * Here, 'ns' is usually a floating number. For example for a 2.5 GHz CPU
> > + * the scaling factor becomes 1,000,000 / 2,500,000 = 1.2.
> > + *
> > + * To be able to use integer math we work around the lack of precision. We
> > + * first scale the integer up (left shift by TICKS_TO_NS_SHIFT) and after the
> > + * multiplication by the number of ticks in pg_ticks_to_ns() we shift right by
> > + * the same amount. We utilize unsigned integers even though ticks are stored
> > + * as a signed value to encourage compilers to generate better assembly.
> > + *
> > + * We remember the maximum number of ticks that can be multiplied by the scale
> > + * factor without overflowing so we can check via a * b > max <=> a > max / b.
> > + *
> > + * On all other platforms we are using clock_gettime(), which uses nanoseconds
> > + * as ticks. Hence, we set the multiplier to zero, which causes pg_ticks_to_ns
> > + * to return the original value.
> > + */
> > +uint64 ticks_per_ns_scaled = 0;
> > +uint64 max_ticks_no_overflow = 0;
> > +bool timing_initialized = false;
>
> Think it may be worth adding an example for a realistic max_ticks_no_overflow
> to show that we almost always are going to be able to avoid the overflow case.

I added a clarifying comment:

* ...
* We remember the maximum number of ticks that can be multiplied by the scale
* factor without overflowing so we can check via a * b > max <=> a > max / b.
*
* However, as this is meant for interval measurements, it is unlikely that the
* overflow path is actually taken in typical scenarios, since overflows would
* only occur for intervals longer than 6.5 days.
*...

Also, FWIW, earlier in that same comment we had bad math (I think), so
I corrected it as 1,000,000 / 2,500,000 = 0.4 (was 1.2).

> > +static inline int64
> > +pg_ticks_to_ns(int64 ticks)
> > {
> > - LARGE_INTEGER f;
> > +#if PG_INSTR_TICKS_TO_NS
> > + int64 ns = 0;
> > +
> > + Assert(timing_initialized);
> > +
> > + /*
> > + * Avoid doing work if we don't use scaled ticks, e.g. system clock on
> > + * Unix
> > + */
> > + if (ticks_per_ns_scaled == 0)
> > + return ticks;
> > +
> > + /*
> > + * Would multiplication overflow? If so perform computation in two parts.
> > + */
> > + if (unlikely(ticks > (int64) max_ticks_no_overflow))
> > + {
> > + /*
> > + * To avoid overflow, first scale total ticks down by the fixed
> > + * factor, and *afterwards* multiply them by the frequency-based scale
> > + * factor.
> > + *
> > + * The remaining ticks can follow the regular formula, since they
> > + * won't overflow.
> > + */
> > + int64 count = ticks >> TICKS_TO_NS_SHIFT;
> > +
> > + ns = count * ticks_per_ns_scaled;
> > + ticks -= (count << TICKS_TO_NS_SHIFT);
> > + }
> > +
> > + ns += (ticks * ticks_per_ns_scaled) >> TICKS_TO_NS_SHIFT;
> > +
> > + return ns;
> > +#else
> > + Assert(timing_initialized);
> >
> > - QueryPerformanceFrequency(&f);
> > - return (double) f.QuadPart;
> > + return ticks;
> > +#endif /* PG_INSTR_TICKS_TO_NS */
>
> Kinda wondering what the best way would be to verify that the overflow case
> actually works. What about adding a regress.c test that checks that we get
> the right results for something like
>
> INSTR_TIME_SET_ZERO(t);
> INSTR_TIME_ADD_NANOSEC(t, some_number);
> EXPECT_EQ_U64(INSTR_TIME_GET_NANOSEC(t), some_number);
>
> for a few different some_numbers?
>

Yeah, that's a good idea. I added a test similar to what you proposed,
with some extra logic to handle the imprecision introduced by the
shifting. I also confirmed that the test correctly fails when removing
a part of the overflow logic.

>
> > Subject: [PATCH v14 4/6] instrumentation: Use Time-Stamp Counter (TSC) on
> > x86-64 for faster measurements
>
> > +bool
> > +check_timing_clock_source(int *newval, void **extra, GucSource source)
> > +{
> > + /*
> > + * Ensure timing is initialized. On Windows (EXEC_BACKEND), GUC hooks can
> > + * be called during InitializeGUCOptions() before InitProcessGlobals() has
> > + * had a chance to run pg_initialize_timing().
> > + */
> > + pg_initialize_timing();
>
> So doesn't that mean the effort to synchronize the tsc freq on EXEC_BACKEND is futile?

Good catch. I think the best way out of that is for us to not
initialize the TSC in the GUC hooks for EXEC_BACKEND, and instead make
a call to pg_set_timing_clock_source in restore_backend_variables,
which will initialize the TSC if not already initialized.

Done that way for now, with an early return in both assign and check
hooks if EXEC_BACKEND and timing not initialized.

>
> > +bool
> > +pg_set_timing_clock_source(TimingClockSourceType source)
> > +{
> > + Assert(timing_initialized);
> > +
> > +#if PG_INSTR_TSC_CLOCK
> > + pg_initialize_timing_tsc();
> > +#endif
> > +
> > +#if PG_INSTR_TSC_CLOCK
> > + switch (source)
> > + {
> > + case TIMING_CLOCK_SOURCE_AUTO:
> > + use_tsc = (tsc_frequency_khz > 0) && tsc_use_by_default();
>
> A global named use_tsc seems a tad too short a name for my personal taste. Too
> likely to also be used by something else.

Renamed this to "timing_tsc_enabled", and also renamed the frequency
variable to timing_tsc_frequency_khz for consistency.

>
> > +int32 tsc_frequency_khz = -1;
> > +
> > +static uint32 tsc_calibrate(void);
> > +
> > +/*
> > + * Detect the TSC frequency and whether RDTSCP is available on x86-64.
> > + *
> > + * This can't be reliably determined at compile time, since the
> > + * availability of an "invariant" TSC (that is not affected by CPU
> > + * frequency changes) is dependent on the CPU architecture. Additionally,
> > + * there are cases where TSC availability is impacted by virtualization,
> > + * where a simple cpuid feature check would not be enough.
> > + */
> > +static void
> > +tsc_detect_frequency(void)
> > +{
> > + tsc_frequency_khz = 0;
>
> So we're using -1 for uninitialized and 0 for unknown? If so we should
> document that.

That's documented in instr_time.h:

/*
* TSC frequency in kHz, set during initialization.
*
* -1 = not yet initialized, 0 = TSC not usable, >0 = frequency in kHz.
*/
extern PGDLLIMPORT int32 timing_tsc_frequency_khz;

Let me know if you think there is a better location.

> > + /* We require RDTSCP support, bail if not available */
> > + if (!x86_feature_available(PG_RDTSCP))
> > + return;
> > +
> > + /* Determine speed at which the TSC advances */
> > + tsc_frequency_khz = x86_tsc_frequency_khz();
> > + if (tsc_frequency_khz > 0)
> > + return;
> > +
> > + /*
> > + * CPUID did not give us the TSC frequency. If TSC is invariant and RDTSCP
> > + * is available, we can measure the frequency by comparing TSC ticks
> > + * against walltime using a short calibration loop.
> > + */
> > + if (x86_feature_available(PG_TSC_INVARIANT))
> > + tsc_frequency_khz = tsc_calibrate();
> > +}
>
> Why do we not need to always check if the TSC is invariant? Seems like a hard
> requirement to me?

Yeah, that's a good point - moved this up to check it together with
RDTSCP support.

> > +static uint32
> > +tsc_calibrate(void)
> > +{
> > + instr_time initial_wall;
> > + int64 initial_tsc;
> > + double freq_khz = 0;
> > + double prev_freq_khz = 0;
> > + int stable_count = 0;
> > + int64 prev_tsc;
> > + uint32 unused;
> > +
> > + /* Ensure INSTR_* time below work on system time */
> > + set_ticks_per_ns_system();
> > +
> > + INSTR_TIME_SET_CURRENT(initial_wall);
> > +
> > +#ifdef _MSC_VER
> > + initial_tsc = __rdtscp(&unused);
> > +#else
> > + initial_tsc = __builtin_ia32_rdtscp(&unused);
> > +#endif
> > + prev_tsc = initial_tsc;
> > +
> > + for (int i = 0; i < TSC_CALIBRATION_ITERATIONS; i++)
> > + {
> > + instr_time now_wall;
> > + int64 now_tsc;
> > + int64 elapsed_ns;
> > + int64 elapsed_ticks;
> > +
> > + INSTR_TIME_SET_CURRENT(now_wall);
> > +
> > +#ifdef _MSC_VER
> > + now_tsc = __rdtscp(&unused);
> > +#else
> > + now_tsc = __builtin_ia32_rdtscp(&unused);
> > +#endif
>
> I'd put this in a helper, given that we already need it twice, and we might
> later need to extend it for other architectures.

Yeah, that's fair. Since we have the same pattern in instr_time.h I
added both "pg_rdtsc" and "pg_rdtscp" as inline helpers.

I assume we want to avoid the extra include, otherwise we could also
put these in pg_cpu.h.

>
> > + /*
> > + * Skip if this is not the Nth cycle where we measure, if TSC hasn't
> > + * advanced, or we walked backwards for some reason.
> > + */
> > + if (i % TSC_CALIBRATION_SKIPS != 0 || now_tsc == prev_tsc || elapsed_ns <= 0 || elapsed_ticks <= 0)
> > + continue;
>
> What's the goal of TSC_CALIBRATION_SKIPS?

This improved the measurement quality in my testing, by measuring the
time elapsed across 100 RDTSCP instructions, instead of a single
instruction. Without this we are more biased towards a later
measurement stabilizing based on just a handful of RDTSCP
instructions, since we're doing

freq_khz = ((double) elapsed_ticks / elapsed_ns) * 1000 * 1000;

i.e. the 1000th measurement and the 1001th measurement have a very
small difference in terms of ticks and ns, causing us to quickly
confirm subsequent measurements, causing an inaccurate TSC frequency.

> > + freq_khz = ((double) elapsed_ticks / elapsed_ns) * 1000 * 1000;
> > +
> > + /*
> > + * Once freq_khz / prev_freq_khz is small, check if it stays that way.
> > + * If it does for long enough, we've got a winner frequency.
> > + */
> > + if (prev_freq_khz != 0 && fabs(1 - freq_khz / prev_freq_khz) < 0.0001)
> > + {
> > + stable_count++;
> > + if (stable_count >= TSC_CALIBRATION_STABLE_CYCLES)
> > + return (uint32) freq_khz;
> > + }
> > + else
> > + stable_count = 0;
> > +
> > + prev_tsc = now_tsc;
> > + prev_freq_khz = freq_khz;
> > + }
> > +
> > + /* did not converge */
> > + return 0;
> > +}
>
> I think it may be worth teaching pg_test_timing to display the tsc frequency
> and show the frequency determined by calibration even if we got it via
> cpuid. Otherwise it'll be hard to debug this.

Displaying the TSC frequency is already done in the later patch that
adds extra RDTSC(P) reporting in pg_test_timing.

However, that later change only reports the frequency that is set in
the global, and doesn't call into the TSC calibration logic if we got
it via CPUID. We could add that if you think its important, but it'd
require exporting tsc_calibrate in instr_time.h.

> > +/*
> > + * Initialize the TSC clock source by determining its usability and frequency.
> > + *
> > + * This can be called multiple times, as tsc_frequency_khz will be set to 0
> > + * if a prior call determined the TSC is not usable. On EXEC_BACKEND (Windows),
> > + * the TSC frequency may also be set by restore_backend_variables.
> > + */
> > +void
> > +pg_initialize_timing_tsc(void)
> > +{
> > + if (tsc_frequency_khz < 0)
> > + tsc_detect_frequency();
> > +}
>
> Maybe we should add a debug log with the frequency in the < 0 case?
>

Makes sense, added, with a !FRONTEND check.

> > From 44292c17075264c659e03d784a6c7619e4c5bd3e Mon Sep 17 00:00:00 2001
> > From: Lukas Fittl <lukas(at)fittl(dot)com>
> > Date: Tue, 10 Mar 2026 01:38:14 -0700
> > Subject: [PATCH v14 6/6] instrumentation: ARM support for fast time
> > measurements
> >
> > Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces
> > use of the cntvct_el0 instruction on ARM systems to access the generic
> > timer that provides a synchronized ticks value across CPUs.
> >
> > Note this adds an exception for Apple Silicon CPUs, due to the observed
> > fact that M3 and newer has different timer frequencies for the Efficiency
> > and the Performance cores, and we can't be sure where we get scheduled.
>
> There are other such SOCs with different clock speeds for
> performance/efficiency cores, I'm a bit worried that just denylisting Apple
> Silicon will end up allowing other such systems.

That makes sense. As described earlier, I'm mainly including that
patch to show that we can expand this to ARM (i.e. our general design
allows for that), but I'd say its probably best considered separately
in a new thread for commit in a future release.

Attached v16 with your requested changes applied, as well as:

- Split out the CPU detection logic into a separate commit (as
suggested in a separate email by Andres)
- Moved a documentation snippet referencing pg_test_timing to the
later commit that shows TSC details in pg_test_timing (if for some
reason we didn't commit that pg_test_timing change, that docs snippet
doesn't make sense)
- Removed the socket counting logic again to determine whether to use
TSC by default, its now only gated on having a frequency, and
PG_RDTSCP + PG_TSC_INVARIANT + PG_TSC_ADJUST. After some pondering and
testing on the one 8 socket machine I could get my hands on
("u-6tb1.56xlarge" instance in AWS), which had a consistent TSC
frequency across its 224 cores, I suspect this may have been more of
an edge case situation the Linux kernel folks were concerned about.
Since we have the GUC to turn off use of the TSC, it seems better to
reduce code complexity here for now.

Thanks,
Lukas

--
Lukas Fittl

Attachment	Content-Type	Size
v16-0002-Allow-retrieving-x86-TSC-frequency-flags-from-CP.patch	application/octet-stream	6.5 KB
v16-0003-instrumentation-Use-Time-Stamp-Counter-TSC-on-x8.patch	application/octet-stream	30.8 KB
v16-0004-pg_test_timing-Also-test-RDTSC-RDTSCP-timing-and.patch	application/octet-stream	7.3 KB
v16-0001-instrumentation-Streamline-ticks-to-nanosecond-c.patch	application/octet-stream	16.4 KB
v16-0005-instrumentation-ARM-support-for-fast-time-measur.patch	application/octet-stream	8.4 KB

In response to

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? at 2026-04-02 00:54:57 from Andres Freund

Responses

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? at 2026-04-03 09:06:14 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Masahiko Sawada	2026-04-02 23:30:42	Re: POC: Parallel processing of indexes in autovacuum
Previous Message	Masahiko Sawada	2026-04-02 23:00:11	Re: POC: Parallel processing of indexes in autovacuum