Quick Links

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	David Geier <geidav(dot)pg(at)gmail(dot)com>
Cc:	Lukas Fittl <lukas(at)fittl(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Hannu Krosing <hannuk(at)google(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date:	2026-02-23 22:27:59
Message-ID:	r5snevsnkyoifjqldu6gcssbnrnezpplq4ofcmekjfvzzu32dc@5rn26itd4ubr
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2026-02-23 16:24:57 +0100, David Geier wrote:
> The code wasn't compiling properly on Windows because __x86_64__ is not
> defined in Visual C++. I've changed the code to use
>
> #if defined(__x86_64__) || defined(_M_X64)

Independently of this patchset I wonder if it'd be worth introducing a
PG_ARCH_X64 or such, to avoid this kind of thing.

> I've tested v8 of the patch (= v7 plus aforementioned changes) on
> Windows. I'm reporting the best of 3 runs.
>
> lotsarows test with parallelism disabled:
>
> master: 2781 ms
> v7: 2776 ms (timing_clock_source = 'system')
> v7: 2091 ms (timing_clock_source = 'tsc')

Nice.

> pg_test_timing:
>
> master: 27.04 ns
> v7: 16.59 ns (QueryxPerformanceCounter)
> v7: 13.69 ns (RDTSCP)
> v7: 9.42 ns (RDTSC)

Very nice.

Unfortunately, on linux, applying up to 0002 cause a small regression in
pg_test_timing.

With cpuidle disabled, performance governor, pinned to one core.

pg_test_timing, turboboost disabled:

412f78c66ee 27.70 ns
0002 28.48 ns

pg_test_timing, turboboost enabled:

412f78c66ee 20.41 ns
0002 21.04 ns

However, I tried, but failed, to push an actual EXPLAIN ANALYZE to show that
difference. All the differences I see are well below the run-to-run noise.

Which makes sense - the increase in overhead here probably is visible because
it increases the dependency chain inside the loop, which wouldn't be visible
in a normal explain (and of course, with more patches applied, a lot more is
won).

> From 25b58d2890e65a95ce426a0b80fab41c1c99bd8f Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Sat, 31 Jan 2026 08:49:46 -0800
> Subject: [PATCH v8 1/4] Check for HAVE__CPUIDEX and HAVE__GET_CPUID_COUNT
> separately
>
> Previously we would only check for the availability of __cpuidex if
> the related __get_cpuid_count was not available on a platform. But there
> are cases where we want to be able to call __cpuidex as the only viable
> option, specifically, when accessing a high leaf like VM Hypervisor
> information (0x40000000), which __get_cpuid_count does not allow.
>
> This will be used in an future commit to access Hypervisor information
> about the TSC frequency of x86 CPUs, where available.
>
> Note that __cpuidex is defined in cpuid.h for GCC/clang, but in intrin.h
> for MSVC. Because we now set HAVE__CPUIDEX for GCC/clang when available,
> adjust existing code to check for _MSC_VER when including intrin.h.
>
> Author: Lukas Fittl <lukas(at)fittl(dot)com>
> Reviewed-by:
> Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

> # Check for XSAVE intrinsics
> diff --git a/meson.build b/meson.build
> index ebfb85e93e5..312c919eaa4 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -2080,7 +2080,8 @@ elif cc.links('''
> endif
>
>
> -# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
> +# Check for __get_cpuid_count() and __cpuidex() separately, since we sometimes
> +# need __cpuidex() even if __get_cpuid_count() is available.
> if cc.links('''
> #include <cpuid.h>
> int main(int arg, char **argv)
> @@ -2091,8 +2092,13 @@ if cc.links('''
> ''', name: '__get_cpuid_count',
> args: test_c_args)
> cdata.set('HAVE__GET_CPUID_COUNT', 1)
> -elif cc.links('''
> +endif
> +if cc.links('''
> + #ifdef _MSC_VER
> #include <intrin.h>
> + #else
> + #include <cpuid.h>
> + #endif
> int main(int arg, char **argv)
> {
> unsigned int exx[4] = {0, 0, 0, 0};

FWIW, this seems to trigger a warning locally:

/srv/dev/build/postgres/m-dev-assert/meson-private/tmpw34r2pnc/testfile.c: In function 'main':
/srv/dev/build/postgres/m-dev-assert/meson-private/tmpw34r2pnc/testfile.c:10:19: warning: pointer targets in passing argument 1 of '__cpuidex' differ in signe>
10 | __cpuidex(exx, 7, 0);
| ^~~
| |
| unsigned int *
In file included from /srv/dev/build/postgres/m-dev-assert/meson-private/tmpw34r2pnc/testfile.c:5:
/home/andres/build/gcc/master/install/lib/gcc/x86_64-pc-linux-gnu/16/include/cpuid.h:361:16: note: expected 'int *' but argument is of type 'unsigned int *'
361 | __cpuidex (int __cpuid_info[4], int __leaf, int __subleaf)
| ~~~~^~~~~~~~~~~~~~~
-----------
Checking if "__cpuidex" links: YES

> diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
> index f586476964f..7a75380b483 100644
> --- a/src/port/pg_crc32c_sse42_choose.c
> +++ b/src/port/pg_crc32c_sse42_choose.c
> @@ -20,11 +20,11 @@
>
> #include "c.h"
>
> -#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
> +#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
> #include <cpuid.h>
> #endif

Why would we want to include cpuid.h with msvc if one of the other variables
is defined?

> -#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
> +#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
> #include <intrin.h>
> #endif

And here, why would we want to include intrin.h if HAVE__CPUID is defined?

Seems like this should just be something like

#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || defined(HAVE__CPUIDEX)
#if defined(_MSC_VER)
#include <intrin.h>
#else
#include <cpuid.h>
#endif /* defined(_MSC_VER) */
#endif

> From 2392d95626599a1b5562f9216eb8c334db99c932 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Fri, 25 Jul 2025 17:57:20 -0700
> Subject: [PATCH v8 2/4] Timing: Streamline ticks to nanosecond conversion
> across platforms
>
> The timing infrastructure (INSTR_* macros) measures time elapsed using
> clock_gettime() on POSIX systems, which returns the time as nanoseconds,
> and QueryPerformanceCounter() on Windows, which is a specialized timing
> clock source that returns a tick counter that needs to be converted to
> nanoseconds using the result of QueryPerformanceFrequency().
>
> This conversion currently happens ad-hoc on Windows, e.g. when calling
> INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
> invocation, despite the frequency being stable after program start,
> incurring unnecessary overhead. It also causes a fractured implementation
> where macros are defined differently between platforms.
>
> To ease code readability, and prepare for a future change that intends
> to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
> a new pg_ticks_to_ns() function that gets called on all platforms.
>
> This function relies on a separately initialized ticks_per_ns_scaled
> value, that represents the conversion ratio. This value is initialized
> from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
> POSIX systems, which results in the ticks being treated as nanoseconds.
> Other architectures always directly return the original ticks.
>
> To support this, pg_initialize_timing() is introduced, and is now
> mandatory for both the backend and any frontend programs to call before
> utilizing INSTR_* macros.

I wonder if it's worth trying to transparently initialize in the overflow
codepath. Probably not, but worth explicitly considering.

> In passing modify pg_test_timing to reduce the per-loop overhead caused
> by repeated divisions in INSTR_TIME_GET_NANOSEC when the ticks variable
> has become very large. Instead diff first and then turn it into nanosecs.

I'd like to see this broken out into a separate change.

> diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
> index a5621251afc..9fd630a490a 100644

> @@ -182,9 +184,8 @@ test_timing(unsigned int duration)
> bits;
>
> prev = cur;
> - INSTR_TIME_SET_CURRENT(temp);
> - cur = INSTR_TIME_GET_NANOSEC(temp);
> - diff = cur - prev;
> + INSTR_TIME_SET_CURRENT(cur);
> + diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
>
> /* Did time go backwards? */
> if (unlikely(diff < 0))

FWIW, I don't think this needs a special INSTR_TIME macro, it could just use
INSTR_TIME_SUBTRACT() and INSTR_TIME_GET_NANOSEC().

> diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
> index cb4e986092e..c8b233be16c 100644
> --- a/src/bin/pgbench/pgbench.c
> +++ b/src/bin/pgbench/pgbench.c
> @@ -7334,6 +7334,9 @@ main(int argc, char **argv)
> initRandomState(&state[i].cs_func_rs);
> }
>
> + /* initialize timing infrastructure (required for INSTR_* calls) */
> + pg_initialize_timing();
> +
> /* opening connection... */
> con = doConnect();
> if (con == NULL)

FWIW, I also verified that I am am unable to see measure overhead in pgbench
due to the more expensive conversion. Not surprised, but it did seem like a
possibility, because pgbench unfortunately always converts the gathered time
to microseconds, rather than compute a difference between two timestamps.

> +
> +/*
> + * Stores what the number of ticks needs to be multiplied with to end up
> + * with nanoseconds using integer math.
> + *
> + * On certain platforms (currently Windows) the ticks to nanoseconds conversion
> + * requires floating point math because:
> + *
> + * sec = ticks / frequency_hz
> + * ns = ticks / frequency_hz * 1,000,000,000
> + * ns = ticks * (1,000,000,000 / frequency_hz)
> + * ns = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
> + *
> + * Here, 'ns' is usually a floating number. For example for a 2.5 GHz CPU
> + * the scaling factor becomes 1,000,000 / 2,500,000 = 1.2.
> + *
> + * To be able to use integer math we work around the lack of precision. We
> + * first scale the integer up and after the multiplication by the number
> + * of ticks in INSTR_TIME_GET_NANOSEC() we divide again by the same value.
> + * We picked the scaler such that it provides enough precision and is a
> + * power-of-two which allows for shifting instead of doing an integer
> + * division. We utilize unsigned integers even though ticks are stored as a
> + * signed value because that encourages compilers to generate better assembly.

> + * On all other platforms we are using clock_gettime(), which uses nanoseconds
> + * as ticks. Hence, we set the multiplier to zero, which causes pg_ticks_to_ns
> + * to return the original value.
> + */
> +uint64 ticks_per_ns_scaled = 0;
> +uint64 max_ticks_no_overflow = 0;
> +
> +static void set_ticks_per_ns(void);
> +
> +void
> +pg_initialize_timing()
> +{
> + set_ticks_per_ns();
> +}
> +
> +#ifndef WIN32
> +
> +static void
> +set_ticks_per_ns()
> +{
> + ticks_per_ns_scaled = 0;
> + max_ticks_no_overflow = 0;
> +}
> +
> +#else /* WIN32 */
> +
> +/* GetTimerFrequency returns counts per second */
> +static inline double
> +GetTimerFrequency(void)
> +{
> + LARGE_INTEGER f;
> +
> + QueryPerformanceFrequency(&f);
> + return (double) f.QuadPart;
> +}
> +
> +static void
> +set_ticks_per_ns()
> +{
> + ticks_per_ns_scaled = INT64CONST(1000000000) * TICKS_TO_NS_PRECISION / GetTimerFrequency();

This should probably use NS_PER_S.

I wonder whether we should use an explicit shift here and in pg_ticks_to_ns(),
to avoid having to rely on the compiler to do so for us.

> +static inline int64
> +pg_ticks_to_ns(int64 ticks)
> +{
> +#if defined(__x86_64__) || defined(_M_X64)
> + int64 ns = 0;
> +
> + if (ticks_per_ns_scaled == 0)
> + return ticks;

There should be comment explaining (or referencing another explanation) for
why this exists.

> + /*
> + * Would multiplication overflow? If so perform computation in two parts.
> + * Check overflow without actually overflowing via: a * b > max <=> a >
> + * max / b
> + */
> + if (unlikely(ticks > (int64) max_ticks_no_overflow))

The "via" comment seems a bit misplaced, given that the transformation is not
really utilized here (but at the point where max_ticks_no_overflow) is
computed.

> + {
> + /*
> + * Compute how often the maximum number of ticks fits completely into
> + * the number of elapsed ticks and convert that number into
> + * nanoseconds. Then multiply by the count to arrive at the final
> + * value. In a 2nd step we adjust the number of elapsed ticks and
> + * convert the remaining ticks.
> + */
> + int64 count = ticks / max_ticks_no_overflow;
> + int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
> +
> + ns = max_ns * count;
> +
> + /*
> + * Subtract the ticks that we now already accounted for, so that they
> + * don't get counted twice.
> + */
> + ticks -= count * max_ticks_no_overflow;
> + Assert(ticks >= 0);

I think we could perhaps make the overflow case a good bit cheaper, by
avoiding any divisions with a non-constant factor (assuming I haven't blown
the logic below). Instead of doing a division we can "transform back" into
the non-scaled representation, I think?

ns = (ticks * ticks_per_ns_scaled) / TICKS_TO_NS_PRECISION

equals, assuming arbitrary precision

ns = (ticks / TICKS_TO_NS_PRECISION) * ticks_per_ns_scaled

and not assuming arbitrary precision:

count = ticks // TICKS_TO_NS_PRECISION
rem_ticks = ticks - (count * TICKS_TO_NS_PRECISION)
ns = count * ticks_per_ns_scaled + rem_ticks * ticks_per_ns_scaled // TICKS_TO_NS_PRECISION

None of which afaict would overflow?

> --- a/src/common/instr_time.c
> +++ b/src/common/instr_time.c
> @@ -20,8 +20,8 @@
> * Stores what the number of ticks needs to be multiplied with to end up
> * with nanoseconds using integer math.
> *
> - * On certain platforms (currently Windows) the ticks to nanoseconds conversion
> - * requires floating point math because:
> + * In certain cases (TSC on x86-64, and QueryPerformanceCounter on Windows)
> + * the ticks to nanoseconds conversion requires floating point math because:
> *
> * sec = ticks / frequency_hz
> * ns = ticks / frequency_hz * 1,000,000,000
> @@ -39,7 +39,7 @@
> * division. We utilize unsigned integers even though ticks are stored as a
> * signed value because that encourages compilers to generate better assembly.
> *
> - * On all other platforms we are using clock_gettime(), which uses nanoseconds
> + * In all other cases we are using clock_gettime(), which uses nanoseconds
> * as ticks. Hence, we set the multiplier to zero, which causes pg_ticks_to_ns
> * to return the original value.
> */
> @@ -48,16 +48,57 @@ uint64 max_ticks_no_overflow = 0;
>
> static void set_ticks_per_ns(void);
>
> +int timing_clock_source = TIMING_CLOCK_SOURCE_AUTO;
> +
> +#if defined(__x86_64__) || defined(_M_X64)
> +/* Indicates if TSC instructions (RDTSC and RDTSCP) are usable. */
> +extern bool has_usable_tsc;
> +
> +static void tsc_initialize(void);
> +static bool tsc_use_by_default(void);
> +static void set_ticks_per_ns_for_tsc(void);
> +static bool set_tsc_frequency_khz(void);
> +static bool is_rdtscp_available(void);
> +#endif
> +
> void
> pg_initialize_timing()
> {
> +#if defined(__x86_64__) || defined(_M_X64)
> + tsc_initialize();
> +#endif
> +
> + set_ticks_per_ns();
> +}
> +
> +bool
> +pg_set_timing_clock_source(TimingClockSourceType source)
> +{
> +#if defined(__x86_64__) || defined(_M_X64)
> + switch (source)
> + {
> + case TIMING_CLOCK_SOURCE_AUTO:
> + use_tsc = has_usable_tsc && tsc_use_by_default();
> + break;
> + case TIMING_CLOCK_SOURCE_SYSTEM:
> + use_tsc = false;
> + break;
> + case TIMING_CLOCK_SOURCE_TSC:
> + if (!has_usable_tsc) /* Tell caller TSC is not usable */
> + return false;
> + use_tsc = true;
> + break;
> + }
> +#endif
> set_ticks_per_ns();
> + timing_clock_source = source;
> + return true;
> }

Perhaps this should ensure that pg_initialize_timing() has already been called?

> +bool
> +check_timing_clock_source(int *newval, void **extra, GucSource source)
> +{
> +#if defined(__x86_64__) || defined(_M_X64)
> + pg_initialize_timing();
> +
> + if (*newval == TIMING_CLOCK_SOURCE_TSC && !has_usable_tsc)
> + {
> + GUC_check_errdetail("TSC is not supported as fast clock source");
> + return false;

The GUC name doesn't refer to "fast", so probably this shouldn't either?

> +const char *
> +show_timing_clock_source()
> +{
> +#if defined(__x86_64__) || defined(_M_X64)
> + TimingClockSourceType effective_source = pg_current_timing_clock_source();
> +
> + switch (timing_clock_source)
> + {
> + case TIMING_CLOCK_SOURCE_AUTO:
> + if (effective_source == TIMING_CLOCK_SOURCE_TSC)
> + return "auto (tsc)";
> + else
> + return "auto (system)";
> + case TIMING_CLOCK_SOURCE_SYSTEM:
> + return "system";
> + case TIMING_CLOCK_SOURCE_TSC:
> + return "tsc";
> + }
> +#else
> + switch (timing_clock_source)
> + {
> + case TIMING_CLOCK_SOURCE_AUTO:
> + return "auto (system)";
> + case TIMING_CLOCK_SOURCE_SYSTEM:
> + return "system";
> + }
> +#endif

Seems like it'd be nicer if we had one switch with the ifdef-ery inside the
TIMING_CLOCK_SOURCE_AUTO case? If we add support for tsc based clock sources
on arm as well, this would get a bit unmanageable.

> +static uint32 tsc_frequency_khz = 0;
> +
> +/*
> + * Decide whether we use the RDTSC/RDTSCP instructions at runtime, for Linux/x86-64,
> + * instead of incurring the overhead of a full clock_gettime() call.
> + *
> + * This can't be reliably determined at compile time, since the
> + * availability of an "invariant" TSC (that is not affected by CPU
> + * frequency changes) is dependent on the CPU architecture. Additionally,
> + * there are cases where TSC availability is impacted by virtualization,
> + * where a simple cpuid feature check would not be enough.
> + */
> +static void
> +tsc_initialize(void)
> +{
> + /*
> + * Compute baseline CPU peformance, determines speed at which the TSC
> + * advances.
> + */
> + if (!set_tsc_frequency_khz())
> + return;
> +
> + has_usable_tsc = is_rdtscp_available();
> +}
> +
> +/*
> + * Decides whether to use TSC clock source if the user did not specify it
> + * one way or the other, and it is available (checked separately).
> + *
> + * Currently only enabled by default on Linux, since Linux already does a
> + * significant amount of work to determine whether TSC is a viable clock
> + * source.
> + */
> +static bool
> +tsc_use_by_default()

Postgres style is still funcname(void).

> +{
> +#if defined(__linux__)
> + FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
> + char buf[128];
> +
> + if (!fp)
> + return false;
> +
> + if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
> + {
> + fclose(fp);
> + return true;
> + }
> +
> + fclose(fp);
> +#endif
> +
> + return false;
> +}

I think this will often disable tsc on VMs, due to linux defaulting to
kvm-clock in KVM VMs.

Do we care about that?

If the tsc is not actually viable, is it still listed in
/sys/devices/system/clocksource/clocksource0/available_clocksource
?

> +
> +#define CPUID_HYPERVISOR_VMWARE(words) (words[1] == 0x61774d56 && words[2] == 0x4d566572 && words[3] == 0x65726177) /* VMwareVMware */
> +#define CPUID_HYPERVISOR_KVM(words) (words[1] == 0x4b4d564b && words[2] == 0x564b4d56 && words[3] == 0x0000004d) /* KVMKVMKVM */
> +
> +static bool
> +set_tsc_frequency_khz()
> +{
> + uint32 r[4] = {0, 0, 0, 0};
> +
> +#if defined(HAVE__GET_CPUID)
> + __get_cpuid(0x15, &r[0] /* denominator */ , &r[1] /* numerator */ , &r[2] /* hz */ , &r[3]);
> +#elif defined(HAVE__CPUID)
> + __cpuid(r, 0x15);
> +#else
> +#error cpuid instruction not available
> +#endif
> +
> + if (r[2] > 0)
> + {
> + if (r[0] == 0 || r[1] == 0)
> + return false;
> +
> + tsc_frequency_khz = r[2] / 1000 * r[1] / r[0];
> + return true;
> + }

I think there should be some explanation about what this is testing.
Including perhaps a reference to the relevant documents.

> + /* Some CPUs only report frequency in 16H */

Dito.

> +#if defined(HAVE__GET_CPUID)
> + __get_cpuid(0x16, &r[0] /* base_mhz */ , &r[1], &r[2], &r[3]);
> +#elif defined(HAVE__CPUID)
> + __cpuid(r, 0x16);
> +#else
> +#error cpuid instruction not available
> +#endif

Perhaps we could package the __get_cpuid / __cpuid thing in a wrapper, instead
of repeating the ifdefery three times?

> index 985b6b5af88..e7191c5d6cd 100644
> --- a/src/include/portability/instr_time.h
> +++ b/src/include/portability/instr_time.h
> @@ -4,9 +4,10 @@
> * portable high-precision interval timing
> *
> * This file provides an abstraction layer to hide portability issues in
> - * interval timing. On Unix we use clock_gettime(), and on Windows we use
> - * QueryPerformanceCounter(). These macros also give some breathing room to
> - * use other high-precision-timing APIs.
> + * interval timing. On x86 we use the RDTSC/RDTSCP instruction directly in
> + * certain cases, or alternatively clock_gettime() on Unix-like systems and
> + * QueryPerformanceCounter() on Windows. These macros also give some breathing
> + * room to use other high-precision-timing APIs.
> *
> * The basic data type is instr_time, which all callers should treat as an
> * opaque typedef. instr_time can store either an absolute time (of
> @@ -17,10 +18,11 @@
> *
> * INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
> *
> - * INSTR_TIME_SET_CURRENT(t) set t to current time
> + * INSTR_TIME_SET_CURRENT_FAST(t) set t to current time without waiting
> + * for instructions in out-of-order window
> *
> - * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
> - * evaluates to whether t changed
> + * INSTR_TIME_SET_CURRENT(t) set t to current time while waiting for
> + * instructions in OOO to retire
> *
> * INSTR_TIME_ADD(x, y) x += y
> *

I'd probably remove INSTR_TIME_SET_CURRENT_LAZY in a prep commit.

> @@ -93,13 +95,54 @@ typedef struct instr_time
> extern PGDLLIMPORT uint64 ticks_per_ns_scaled;
> extern PGDLLIMPORT uint64 max_ticks_no_overflow;
>
> +#if defined(__x86_64__) || defined(_M_X64)
> +#include <immintrin.h>

Why do we need to include immintrin.h in instr_time.h? Including immintrin.h
makes compilation a lot slower:

$ echo '#include <immintrin.h>'|gcc -ftime-report -xc -o /dev/null -c -

Time variable wall GGC
phase setup : 0.00 ( 1%) 1905k ( 6%)
phase parsing : 0.50 ( 99%) 30M ( 94%)
preprocessing : 0.09 ( 18%) 5375k ( 16%)
lexical analysis : 0.02 ( 5%) 0 ( 0%)
parser (global) : 0.31 ( 61%) 12M ( 39%)
parser inl. func. body : 0.08 ( 15%) 12M ( 39%)
TOTAL : 0.51 32M

> int
> @@ -46,10 +46,47 @@ main(int argc, char *argv[])
> /* initialize timing infrastructure (required for INSTR_* calls) */
> pg_initialize_timing();
>
> - loop_count = test_timing(test_duration);
> -
> + /*
> + * First, test default (non-fast) timing code. A clock source for that is
> + * always available. Hence, we can unconditionally output the result.
> + */
> + loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_SYSTEM, false);
> output(loop_count);
>
> +#if defined(__x86_64__) || defined(_M_X64)

I don't love that now test_timing.c has architecture specific checks. Could
we abstract this a bit more?

> + /*
> + * If on a supported architecture, test the RDTSC clock source. This clock
> + * source is not always available. In that case the loop count will be 0
> + * and we don't print.
> + *
> + * We first emit RDTSCP timings, which is slower, and gets used for higher
> + * precision measurements when the TSC clock source is enabled. We emit
> + * RDTSC second, which is used for faster timing measurements with lower
> + * precision.
> + */
> + printf("\n");
> + loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, false);
> + if (loop_count > 0)
> + {
> + output(loop_count);
> + printf("\n");
> +
> + /* Now, emit fast timing measurements */
> + loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, true);
> + output(loop_count);
> + printf("\n");
> +
> + pg_set_timing_clock_source(TIMING_CLOCK_SOURCE_AUTO);
> + if (pg_current_timing_clock_source() == TIMING_CLOCK_SOURCE_TSC)
> + printf(_("TSC clock source will be used by default, unless timing_clock_source is set to 'system'.\n"));
> + else
> + printf(_("TSC clock source will not be used by default, unless timing_clock_source is set to 'tsc'.\n"));
> + }
> + else
> + printf(_("TSC clock source is not usable. Likely unable to determine TSC frequency. are you running in an unsupported virtualized environment?.\n"));
> +#endif
> +

A bit weird that most of the output stuff is handled in output(), but then
some of it is handled directly in main() now, some of it in test_timing().

Greetings,

Andres Freund

In response to

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? at 2026-02-23 15:24:57 from David Geier

Responses

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? at 2026-02-24 02:01:51 from John Naylor
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? at 2026-02-25 10:00:35 from Lukas Fittl

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jacob Champion	2026-02-23 22:51:46	Re: pgsql: libpq: Grease the protocol by default
Previous Message	Jelte Fennema-Nio	2026-02-23 22:24:34	Re: pgsql: libpq: Grease the protocol by default