Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Lukas Fittl <lukas(at)fittl(dot)com>
Cc: John Naylor <johncnaylorls(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Hannu Krosing <hannuk(at)google(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, David Geier <geidav(dot)pg(at)gmail(dot)com>
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date: 2026-04-02 00:54:57
Message-ID: 33xt5wzr7g4kstfcugjm33rnm6uwtolay33ku5ctnfoyb235jm@7m42xkovofxr
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Lukas, could you add some preliminary Reviewed-by tags to the
not-yet-committed commitse?

On 2026-03-25 18:17:44 -0700, Lukas Fittl wrote:
> From 7cc7e230e05947892b2aa479eba45144beede48e Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Sat, 31 Jan 2026 08:49:46 -0800
> Subject: [PATCH v14 1/6] Check for HAVE__CPUIDEX and HAVE__GET_CPUID_COUNT
> separately

John, do you want to take this one, or should I take it?

> Subject: [PATCH v14 2/6] pg_test_timing: Reduce per-loop overhead

Pushed, after simplifying the following

> + INSTR_TIME_ADD_NANOSEC(end_time, duration > 0 ? duration * NS_PER_S : 0);

by removing that ?:, since duration is unsigned and the caller checks for 0
and negative values. And even if a negative value or 0 were to be passed, it'd
still be harmless.

> From 8fbde9831fd89615a41af56917aed9faf619dbbb Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Fri, 25 Jul 2025 17:57:20 -0700
> Subject: [PATCH v14 3/6] instrumentation: Streamline ticks to nanosecond
> conversion across platforms
>
> The timing infrastructure (INSTR_* macros) measures time elapsed using
> clock_gettime() on POSIX systems, which returns the time as nanoseconds,
> and QueryPerformanceCounter() on Windows, which is a specialized timing
> clock source that returns a tick counter that needs to be converted to
> nanoseconds using the result of QueryPerformanceFrequency().
>
> This conversion currently happens ad-hoc on Windows, e.g. when calling
> INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
> invocation, despite the frequency being stable after program start,
> incurring unnecessary overhead. It also causes a fractured implementation
> where macros are defined differently between platforms.
>
> To ease code readability, and prepare for a future change that intends
> to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
> a new pg_ticks_to_ns() function that gets called on all platforms.
>
> This function relies on a separately initialized ticks_per_ns_scaled
> value, that represents the conversion ratio. This value is initialized
> from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
> POSIX systems, which results in the ticks being treated as nanoseconds.
> Other architectures always directly return the original ticks.

> To support this, pg_initialize_timing() is introduced, and is now
> mandatory for both the backend and any frontend programs to call before
> utilizing INSTR_* macros.

Are there assertions detecting missing initializations? Ah, yes, there is.

> +/*
> + * Stores what the number of ticks needs to be multiplied with to end up
> + * with nanoseconds using integer math.
> + *
> + * On certain platforms (currently Windows) the ticks to nanoseconds conversion
> + * requires floating point math because:
> + *
> + * sec = ticks / frequency_hz
> + * ns = ticks / frequency_hz * 1,000,000,000
> + * ns = ticks * (1,000,000,000 / frequency_hz)
> + * ns = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
> + *
> + * Here, 'ns' is usually a floating number. For example for a 2.5 GHz CPU
> + * the scaling factor becomes 1,000,000 / 2,500,000 = 1.2.
> + *
> + * To be able to use integer math we work around the lack of precision. We
> + * first scale the integer up (left shift by TICKS_TO_NS_SHIFT) and after the
> + * multiplication by the number of ticks in pg_ticks_to_ns() we shift right by
> + * the same amount. We utilize unsigned integers even though ticks are stored
> + * as a signed value to encourage compilers to generate better assembly.
> + *
> + * We remember the maximum number of ticks that can be multiplied by the scale
> + * factor without overflowing so we can check via a * b > max <=> a > max / b.
> + *
> + * On all other platforms we are using clock_gettime(), which uses nanoseconds
> + * as ticks. Hence, we set the multiplier to zero, which causes pg_ticks_to_ns
> + * to return the original value.
> + */
> +uint64 ticks_per_ns_scaled = 0;
> +uint64 max_ticks_no_overflow = 0;
> +bool timing_initialized = false;

Think it may be worth adding an example for a realistic max_ticks_no_overflow
to show that we almost always are going to be able to avoid the overflow case.

> +static void set_ticks_per_ns(void);
> +
> +void
> +pg_initialize_timing(void)
> +{
> + if (timing_initialized)
> + return;
> +
> + set_ticks_per_ns();
> + timing_initialized = true;
> +}
> +
> +#ifndef WIN32
> +
> +static void
> +set_ticks_per_ns(void)
> +{
> + ticks_per_ns_scaled = 0;
> + max_ticks_no_overflow = 0;
> +}
> +
> +#else /* WIN32 */
> +
> +/* GetTimerFrequency returns counts per second */
> +static inline double
> +GetTimerFrequency(void)
> +{
> + LARGE_INTEGER f;
> +
> + QueryPerformanceFrequency(&f);
> + return (double) f.QuadPart;
> +}
> +
> +static void
> +set_ticks_per_ns(void)
> +{
> + ticks_per_ns_scaled = (NS_PER_S << TICKS_TO_NS_SHIFT) / GetTimerFrequency();
> + max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
> +}
> +
> +#endif /* WIN32 */

Seems a tiny bit odd to introduce them here with a name to then evolve the
naming pattern in the subsequent patches.

> +static inline int64
> +pg_ticks_to_ns(int64 ticks)
> {
> - LARGE_INTEGER f;
> +#if PG_INSTR_TICKS_TO_NS
> + int64 ns = 0;
> +
> + Assert(timing_initialized);
> +
> + /*
> + * Avoid doing work if we don't use scaled ticks, e.g. system clock on
> + * Unix
> + */
> + if (ticks_per_ns_scaled == 0)
> + return ticks;
> +
> + /*
> + * Would multiplication overflow? If so perform computation in two parts.
> + */
> + if (unlikely(ticks > (int64) max_ticks_no_overflow))
> + {
> + /*
> + * To avoid overflow, first scale total ticks down by the fixed
> + * factor, and *afterwards* multiply them by the frequency-based scale
> + * factor.
> + *
> + * The remaining ticks can follow the regular formula, since they
> + * won't overflow.
> + */
> + int64 count = ticks >> TICKS_TO_NS_SHIFT;
> +
> + ns = count * ticks_per_ns_scaled;
> + ticks -= (count << TICKS_TO_NS_SHIFT);
> + }
> +
> + ns += (ticks * ticks_per_ns_scaled) >> TICKS_TO_NS_SHIFT;
> +
> + return ns;
> +#else
> + Assert(timing_initialized);
>
> - QueryPerformanceFrequency(&f);
> - return (double) f.QuadPart;
> + return ticks;
> +#endif /* PG_INSTR_TICKS_TO_NS */

Kinda wondering what the best way would be to verify that the overflow case
actually works. What about adding a regress.c test that checks that we get
the right results for something like

INSTR_TIME_SET_ZERO(t);
INSTR_TIME_ADD_NANOSEC(t, some_number);
EXPECT_EQ_U64(INSTR_TIME_GET_NANOSEC(t), some_number);

for a few different some_numbers?

> Subject: [PATCH v14 4/6] instrumentation: Use Time-Stamp Counter (TSC) on
> x86-64 for faster measurements

> +bool
> +check_timing_clock_source(int *newval, void **extra, GucSource source)
> +{
> + /*
> + * Ensure timing is initialized. On Windows (EXEC_BACKEND), GUC hooks can
> + * be called during InitializeGUCOptions() before InitProcessGlobals() has
> + * had a chance to run pg_initialize_timing().
> + */
> + pg_initialize_timing();

So doesn't that mean the effort to synchronize the tsc freq on EXEC_BACKEND is futile?

> diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
> index 1d9ee4fb588..ee0e3a3b0ab 100644
> --- a/src/bin/pg_test_timing/pg_test_timing.c
> +++ b/src/bin/pg_test_timing/pg_test_timing.c
> @@ -43,7 +43,9 @@ main(int argc, char *argv[])
>
> handle_args(argc, argv);
>
> - /* initialize timing infrastructure (required for INSTR_* calls) */
> + /*
> + * Initialize timing infrastructure (required for INSTR_* calls)
> + */
> pg_initialize_timing();
>
> loop_count = test_timing(test_duration);

This seems like a superfluous change?

> +bool
> +pg_set_timing_clock_source(TimingClockSourceType source)
> +{
> + Assert(timing_initialized);
> +
> +#if PG_INSTR_TSC_CLOCK
> + pg_initialize_timing_tsc();
> +#endif
> +
> +#if PG_INSTR_TSC_CLOCK
> + switch (source)
> + {
> + case TIMING_CLOCK_SOURCE_AUTO:
> + use_tsc = (tsc_frequency_khz > 0) && tsc_use_by_default();

A global named use_tsc seems a tad too short a name for my personal taste. Too
likely to also be used by something else.

> +int32 tsc_frequency_khz = -1;
> +
> +static uint32 tsc_calibrate(void);
> +
> +/*
> + * Detect the TSC frequency and whether RDTSCP is available on x86-64.
> + *
> + * This can't be reliably determined at compile time, since the
> + * availability of an "invariant" TSC (that is not affected by CPU
> + * frequency changes) is dependent on the CPU architecture. Additionally,
> + * there are cases where TSC availability is impacted by virtualization,
> + * where a simple cpuid feature check would not be enough.
> + */
> +static void
> +tsc_detect_frequency(void)
> +{
> + tsc_frequency_khz = 0;

So we're using -1 for uninitialized and 0 for unknown? If so we should
document that.

> + /* We require RDTSCP support, bail if not available */
> + if (!x86_feature_available(PG_RDTSCP))
> + return;
> +
> + /* Determine speed at which the TSC advances */
> + tsc_frequency_khz = x86_tsc_frequency_khz();
> + if (tsc_frequency_khz > 0)
> + return;
> +
> + /*
> + * CPUID did not give us the TSC frequency. If TSC is invariant and RDTSCP
> + * is available, we can measure the frequency by comparing TSC ticks
> + * against walltime using a short calibration loop.
> + */
> + if (x86_feature_available(PG_TSC_INVARIANT))
> + tsc_frequency_khz = tsc_calibrate();
> +}

Why do we not need to always check if the TSC is invariant? Seems like a hard
requirement to me?

> +/*
> + * Calibrate the TSC frequency by comparing TSC ticks against walltime.
> + *
> + * Takes initial TSC and system clock snapshots, then loops, recomputing the
> + * frequency each TSC_CALIBRATION_SKIPS iterations from cumulative TSC
> + * ticks divided by elapsed time.
> + *
> + * Once the frequency estimate stabilizes (consecutive iterations agree), we
> + * consider it converged and the frequency in KHz is returned. If either too
> + * many iterations or a time limit passes without convergence, 0 is returned.
> + */
> +#define TSC_CALIBRATION_MAX_NS (50 * NS_PER_MS)
> +#define TSC_CALIBRATION_ITERATIONS 1000000
> +#define TSC_CALIBRATION_SKIPS 100
> +#define TSC_CALIBRATION_STABLE_CYCLES 10
> +

> +static uint32
> +tsc_calibrate(void)
> +{
> + instr_time initial_wall;
> + int64 initial_tsc;
> + double freq_khz = 0;
> + double prev_freq_khz = 0;
> + int stable_count = 0;
> + int64 prev_tsc;
> + uint32 unused;
> +
> + /* Ensure INSTR_* time below work on system time */
> + set_ticks_per_ns_system();
> +
> + INSTR_TIME_SET_CURRENT(initial_wall);
> +
> +#ifdef _MSC_VER
> + initial_tsc = __rdtscp(&unused);
> +#else
> + initial_tsc = __builtin_ia32_rdtscp(&unused);
> +#endif
> + prev_tsc = initial_tsc;
> +
> + for (int i = 0; i < TSC_CALIBRATION_ITERATIONS; i++)
> + {
> + instr_time now_wall;
> + int64 now_tsc;
> + int64 elapsed_ns;
> + int64 elapsed_ticks;
> +
> + INSTR_TIME_SET_CURRENT(now_wall);
> +
> +#ifdef _MSC_VER
> + now_tsc = __rdtscp(&unused);
> +#else
> + now_tsc = __builtin_ia32_rdtscp(&unused);
> +#endif

I'd put this in a helper, given that we already need it twice, and we might
later need to extend it for other architectures.

> + /*
> + * Skip if this is not the Nth cycle where we measure, if TSC hasn't
> + * advanced, or we walked backwards for some reason.
> + */
> + if (i % TSC_CALIBRATION_SKIPS != 0 || now_tsc == prev_tsc || elapsed_ns <= 0 || elapsed_ticks <= 0)
> + continue;

What's the goal of TSC_CALIBRATION_SKIPS?

> + freq_khz = ((double) elapsed_ticks / elapsed_ns) * 1000 * 1000;
> +
> + /*
> + * Once freq_khz / prev_freq_khz is small, check if it stays that way.
> + * If it does for long enough, we've got a winner frequency.
> + */
> + if (prev_freq_khz != 0 && fabs(1 - freq_khz / prev_freq_khz) < 0.0001)
> + {
> + stable_count++;
> + if (stable_count >= TSC_CALIBRATION_STABLE_CYCLES)
> + return (uint32) freq_khz;
> + }
> + else
> + stable_count = 0;
> +
> + prev_tsc = now_tsc;
> + prev_freq_khz = freq_khz;
> + }
> +
> + /* did not converge */
> + return 0;
> +}

I think it may be worth teaching pg_test_timing to display the tsc frequency
and show the frequency determined by calibration even if we got it via
cpuid. Otherwise it'll be hard to debug this.

> +/*
> + * Initialize the TSC clock source by determining its usability and frequency.
> + *
> + * This can be called multiple times, as tsc_frequency_khz will be set to 0
> + * if a prior call determined the TSC is not usable. On EXEC_BACKEND (Windows),
> + * the TSC frequency may also be set by restore_backend_variables.
> + */
> +void
> +pg_initialize_timing_tsc(void)
> +{
> + if (tsc_frequency_khz < 0)
> + tsc_detect_frequency();
> +}

Maybe we should add a debug log with the frequency in the < 0 case?

> +uint32
> +x86_tsc_frequency_khz(void)
> +{
> + unsigned int r[4] = {0};
> +
> + if (!x86_feature_available(PG_TSC_INVARIANT))
> + return 0;
> +
> + if (x86_feature_available(PG_HYPERVISOR))
> + return x86_hypervisor_tsc_frequency_khz();
> +
> + /*
> + * On modern Intel CPUs, the TSC is implemented by invariant timekeeping
> + * hardware, also called "Always Running Timer", or ART. The ART stays
> + * consistent even if the CPU changes frequency due to changing power
> + * levels.
> + *
> + * As documented in "Determining the Processor Base Frequency" in the
> + * "Intel® 64 and IA-32 Architectures Software Developer's Manual",
> + * February 2026 Edition, we can get the TSC frequency as follows:
> + *
> + * Nominal TSC frequency = ( CPUID.15H:ECX[31:0] * CPUID.15H:EBX[31:0] ) /
> + * CPUID.15H:EAX[31:0]
> + *
> + * With CPUID.15H:ECX representing the nominal core crystal clock
> + * frequency, and EAX/EBX representing values used to translate the TSC
> + * value to that frequency, see "Chapter 20.17 "Time-Stamp Counter" of
> + * that manual.
> + *
> + * Older Intel CPUs, and other vendors do not set CPUID.15H:ECX, and as
> + * such we fall back to alternate approaches.
> + */
> + pg_cpuid(0x15, r);
> + if (r[ECX] > 0)
> + {
> + /*
> + * EBX not being set indicates invariant TSC is not available. Require
> + * EAX being non-zero too, to avoid a theoretical divide by zero.
> + */
> + if (r[EAX] == 0 || r[EBX] == 0)
> + return 0;
> +
> + return r[ECX] / 1000 * r[EBX] / r[EAX];
> + }
> +
> + /*
> + * When CPUID.15H is not available/incomplete, but we have verified an
> + * invariant TSC is used, we can instead get the processor base frequency
> + * in MHz from CPUID.16H:EAX, the "Processor Frequency Information Leaf".
> + */

s/can instead/can possibly instead/

As it's also optional...

> From 44292c17075264c659e03d784a6c7619e4c5bd3e Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Tue, 10 Mar 2026 01:38:14 -0700
> Subject: [PATCH v14 6/6] instrumentation: ARM support for fast time
> measurements
>
> Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces
> use of the cntvct_el0 instruction on ARM systems to access the generic
> timer that provides a synchronized ticks value across CPUs.
>
> Note this adds an exception for Apple Silicon CPUs, due to the observed
> fact that M3 and newer has different timer frequencies for the Efficiency
> and the Performance cores, and we can't be sure where we get scheduled.

There are other such SOCs with different clock speeds for
performance/efficiency cores, I'm a bit worried that just denylisting Apple
Silicon will end up allowing other such systems.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2026-04-02 01:12:55 Re: More speedups for tuple deformation
Previous Message Srinath Reddy Sadipiralla 2026-04-02 00:35:45 Re: Adding REPACK [concurrently]