Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Lukas Fittl <lukas(at)fittl(dot)com>
Cc: John Naylor <johncnaylorls(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Hannu Krosing <hannuk(at)google(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, David Geier <geidav(dot)pg(at)gmail(dot)com>
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date: 2026-04-07 00:40:00
Message-ID: zwv2ggywiz23vghehofkvsrunlmrzc2zbrohd6i4j6a53meb4l@3vl36n4tbxvp
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-04-06 04:25:36 -0700, Lukas Fittl wrote:
> On Sun, Apr 5, 2026 at 10:15 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > - tsc_use_by_default() may be documenting things that aren't the case anymore
> >
>
> I don't think there is a correctness issue, unless you mean the fact
> that we're not doing the < 8 socket check anymore that is one of the
> things mentioned on the LKML posts referenced. I think the LKML post
> references are still helpful as evidence why we trust that Intel has
> reliable TSC handling.

Well:
"Mirrors the Linux kernel's clocksource watchdog disable logic"
doesn't seem quite right, given that in that place we are just checking
TSC_ADJUST and we don't have the < 8 socket check.

I'd probably say something like 'inspired by ... ' and mention that the rest
of the check is in tsc_detect_frequency().

I wonder if the cpuid tests should be a bit further abstracted into
pg_cpu_x86.c.

E.g. instead of tsc_detect_frequency() checking for PG_RDTSCP,
PG_TSC_INVARIANT, PG_TSC_ADJUST we could have

PG_TSC_AVAILABLE /* RDTSCP & INVARIANT */
PG_TSC_KNOWN_RELIABLE /* PG_TSC_AVAILABLE && PG_TSC_ADJUST */
PG_TSC_FREQUENCY_KNOWN /* x86_tsc_frequency_khz works */

and always run all of that during set_x86_features().

> I did note a bit of a grammar oddity at the end of the comment, fixed that.
>
> > - It may be paranoia, but it seems like tsc_calibrate() should perhaps save
> > the old clock source and restore it at the end?
>
> Sure, seems reasonable. Done.

Cool.

> > - Should pg_initialize_timing() allow repeated initialization? Seems like that
> > would normally be a bug?
>
> I'm trying to recall if restore_backend_variables might have a problem
> if we didn't allow that? (since we call pg_initialize_timing there,
> which I think is due to ordering)

Yea, that'd make it problematic.

> > - It's nice that pg_test_timing shows the frequency. I was thinking it were
> > able to show the result of the calibration, even if we were able to
> > determine the frequency without calibration. That should make it easier to
> > figure out whether the calibration works well.
>
> Added. I've renamed "tsc_calibrate" to "pg_tsc_calibrate_frequency"
> and exported that, to support that.

Nice.

Workstation idle:
TSC frequency in use: 2500000 kHz
TSC frequency from calibration: 2499519 kHz

Busy:
TSC frequency in use: 2500000 kHz
TSC frequency from calibration: 2499262 kHz

Completely overwhelmed (load >1200):
TSC frequency in use: 2500000 kHz
TSC frequency from calibration: 2499405 kHz

That's very much good enough.

> FWIW, this now calibrates twice in pg_test_timing if we're on a system
> that has to use calibration. If we wanted to avoid that, we could
> introduce some kind of flag that indicated the TSC frequency was
> already determined through calibration. Not sure if needed?

I have no concern whatsoever with doing it twice in pg_test_timing.

> > - Wonder if some of the code would look a bit cleaner if timing_tsc_enabled,
> > timing_tsc_frequency_khz were defined regardless of PG_INSTR_TSC_CLOCK.
>
> Yeah, I don't see harm in defining them always, and its easier on the
> eyes. Done. Likewise, I've also made the timing_tsc_frequency_khz in
> BackendParameters defined always.

Nice.

One thing this reminded me of is that pg_set_timing_clock_source() does:

bool
pg_set_timing_clock_source(TimingClockSourceType source)
{
Assert(timing_initialized);

#if PG_INSTR_TSC_CLOCK
pg_initialize_timing_tsc();

switch (source)
{
case TIMING_CLOCK_SOURCE_AUTO:
timing_tsc_enabled = (timing_tsc_frequency_khz > 0) && tsc_use_by_default();
break;
case TIMING_CLOCK_SOURCE_SYSTEM:
timing_tsc_enabled = false;
break;
case TIMING_CLOCK_SOURCE_TSC:
/* Tell caller TSC is not usable */
if (timing_tsc_frequency_khz <= 0)
return false;
timing_tsc_enabled = true;
break;
}
#endif

set_ticks_per_ns();
timing_clock_source = source;
return true;
}

Which means that if building without PG_INSTR_TSC_CLOCK and called with
TIMING_CLOCK_SOURCE_TSC, it'd return true despite having done something bogus.

I was also wondering if there an argument for moving the
pg_initialize_timing_tsc() into the relevant switch() cases, so the
calibration doesn't run if configured with TIMING_CLOCK_SOURCE_SYSTEM. But I
think because during GUC initialization we'll be called with the builtin
value, that wouldn't change anything?

> >
> > - is it ok that we are doing pg_cpuid_subleaf() without checking the result?
> >
> > It's not clear to me if a failed __get_cpuid_count() would clear out the old
> > reg or leave it in place.
>
> Hm, maybe best if we just memset reg in pg_cpuid_subleaf
> unconditionally, before calling __get_cpuid_count / __cpuidex.

Yea, I think that'd be safer.

> > - How much do we care about weird results when dynamically changing
> > timing_clocksource?
> >
> > postgres[182055][1]=# EXPLAIN ANALYZE SELECT set_config('timing_clock_source', 'tsc', true), pg_sleep(1), set_config('timing_clock_source', 'system', true), pg_sleep(1);
> > ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> > │ QUERY PLAN │
> > ├────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> > │ Result (cost=0.00..0.01 rows=1 width=72) (actual time=-6540570569.396..-6540570569.395 rows=1.00 loops=1) │
> > │ Planning Time: 0.184 ms │
> > │ Execution Time: -6540570569.355 ms │
> > └────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> > (3 rows)
> >
> > Time: 2002.350 ms (00:02.002)
>
> That was brought up earlier in the thread as well, and I added a code
> comment in response.

Should there be a comment in the docs about it?

> I think the trade-off here is that if we make this a more restrictive
> GUC level (the main solution I can think of), we take away the ability
> for users to confirm whether the new timing logic caused their timings
> to be inaccurate.

Yea, it is very useful. I guess an inbetween could be to make it SUSET.

Is there an argument that a user could hide the cost of their queries from
things like pg_stat_statements and that therefore it should be SUSET?

> And it seems very unlikely that someone would actually change the GUC within
> a query (or within a function).

I agree. I think it's worth mentioning and worth thinking about whether it
needs to be SUSET (re the question above), but I don't see an argument for
making it PGC_SIGHUP or such.

Not for now, but I think it'd be nice if the GUC framework had a way of
expressing that some settings can only be changed at the top-level.

> > - 'tsc' describes just x86-64, even if there is a patch to support aarch64.
> > Perhaps it'd be enough to sprinkle a few "E.g. on x86-64, ..." around.
>
> Hmm. I'm not sure how we can improve that really by adding "E.g."
> somewhere, but maybe I don't follow.

+ <literal>tsc</literal> (measures timing using the x86-64 Time-Stamp Counter (TSC)
+ by directly executing RDTSC/RDTSCP instructions, see below)

If that instead is something like 'tsc' (measures timing with a CPU
instruction, e.g. using RDTSC/RDTSCP on x86-64) it would not be wrong even
after adding aarch64 support.

> What I could see us doing is explicitly calling out that TSC is not
> supported on other architectures?

Yea, I think it'd be good to mention that.

> Subject: [PATCH v20 1/5] instrumentation: Streamline ticks to nanosecond
> conversion across platforms

> +static inline int64
> +pg_ticks_to_ns(int64 ticks)
> {
> - LARGE_INTEGER f;
> +#if PG_INSTR_TICKS_TO_NS
> + int64 ns = 0;
> +
> + Assert(timing_initialized);
> +
> + /*
> + * Avoid doing work if we don't use scaled ticks, e.g. system clock on
> + * Unix
> + */

Maybe add something like "(in that case ticks is counted in nanoseconds)"?

Leaving aside that I don't think it makes sense to push this without also
pushing 0002/0003, I think this is eady.

> Subject: [PATCH v20 2/5] Allow retrieving x86 TSC frequency/flags from CPUID
>
> This adds additional x86 specific CPUID checks for flags needed for
> determining whether the Time-Stamp Counter (TSC) is usable on a given
> system, as well as a helper function to retrieve the TSC frequency from
> CPUID.
>
> This is intended for a future patch that will utilize the TSC to lower
> the overhead of timing instrumentation.
>
> In passing, always make pg_cpuid_subleaf reset the variables used for its
> result, to avoid accidentally using stale results if __get_cpuid_count
> errors out.

> +/*
> + * Determine the TSC frequency of the CPU through CPUID, where supported.
> + *
> + * Needed to interpret the tick value returned by RDTSC/RDTSCP. Return value of
> + * 0 indicates the frequency information was not accessible via CPUID.
> + */
> +uint32
> +x86_tsc_frequency_khz(void)
> +{
> + unsigned int reg[4] = {0};
> +
> + if (x86_feature_available(PG_HYPERVISOR))
> + return x86_hypervisor_tsc_frequency_khz();

Is there a point in checking whether the things below are present if the
hypervisor specific logic doesn't find a freq? I think it can be configured
to be passed through on some hypervisor / cpu combinations.

I think this is also close to ready, except for the minor details I raised at
the start and just here.

> From c58ea726f9c1a89eed46ee45a182457cea737d79 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Thu, 2 Apr 2026 13:17:11 -0700
> Subject: [PATCH v20 3/5] instrumentation: Use Time-Stamp Counter (TSC) on
> x86-64 for faster measurements
>
> This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
> from the CPU using RDTSC/RDTSC instructions, instead of APIs like

Missing P at the end of RDTSC/RDTSC.

> clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with
> ANALYZE and TIMING ON. Tests showed that runtime when instrumented can be
> reduced by up to 10% for queries moving lots of rows through the plan.

FWIW, I see considerably bigger gains in some cases. Mostly queries with many
query "levels". But even some simple ones:

Baseline:

SELECT * FROM pgbench_accounts LIMIT 1 OFFSET 10000000;
\timing reports 322.548 ms

Baseline with EXPLAIN ANALYZE overhead:

EXPLAIN (ANALYZE, BUFFERS 0, TIMING OFF) SELECT * FROM pgbench_accounts LIMIT 1 OFFSET 10000000;

QUERY PLAN
Limit (cost=168370.00..168370.02 rows=1 width=97) (actual rows=0.00 loops=1)
-> Seq Scan on pgbench_accounts (cost=0.00..168370.00 rows=10000000 width=97) (actual rows=10000000.00 loops=1)
Planning Time: 0.059 ms
Execution Time: 426.570 ms

1.32 x slowdown.

SET timing_clock_source = 'system';
EXPLAIN (ANALYZE, BUFFERS 0) SELECT * FROM pgbench_accounts LIMIT 1 OFFSET 10000000;

Limit (cost=168370.00..168370.02 rows=1 width=97) (actual time=882.843..882.843 rows=0.00 loops=1)
-> Seq Scan on pgbench_accounts (cost=0.00..168370.00 rows=10000000 width=97) (actual time=0.021..593.587 rows=10000000.00 loops=1)
Planning Time: 0.063 ms
Execution Time: 882.860 ms

2.06 x slowdown relative to TIMING OFF

SET timing_clock_source = 'tsc';
Limit (cost=168370.00..168370.02 rows=1 width=97) (actual time=543.098..543.098 rows=0.00 loops=1)
-> Seq Scan on pgbench_accounts (cost=0.00..168370.00 rows=10000000 width=97) (actual time=0.017..413.878 rows=10000000.00 loops=1)
Planning Time: 0.061 ms
Execution Time: 543.122 ms

1.27 x slowdown relative to TIMING OFF

1.63x speedup relative to system.

But I also see ~20% gains for some TPCH queries, for example.

> To control use of the TSC, the new "timing_clock_source" GUC is introduced,
> whose default ("auto") automatically uses the TSC when running on Linux/x86-64,
> in case the system clocksource is reported as "tsc". The use of the system
> APIs can be enforced by setting "system", or on x86-64 architectures the
> use of TSC can be enforced by explicitly setting "tsc".

It's more widely enabled by default now, right?

> In order to use the TSC the frequency is first determined by use of CPUID,
> and if not available, by running a short calibration loop at program start,
> falling back to the system time if TSC values are not stable.
>
> Note, that we split TSC usage into the RDTSC CPU instruction which does not
> wait for out-of-order execution (faster, less precise) and the RDTSCP instruction,
> which waits for outstanding instructions to retire. RDTSCP is deemed to have
> little benefit in the typical InstrStartNode() / InstrStopNode() use case of
> EXPLAIN, and can be up to twice as slow. To separate these use cases, the new
> macro INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC.
>
> The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed
> to be used when precision is more important than performance. When the
> system timing clock source is used both of these macros instead utilize
> the system APIs (clock_gettime / QueryPerformanceCounter) like before.

Maybe worth adding that there are other things that may be worth converting,
like track_io_timing/track_wal_io_timing.

> +const char *
> +show_timing_clock_source(void)
> +{
> + switch (timing_clock_source)
> + {
> + case TIMING_CLOCK_SOURCE_AUTO:
> +#if PG_INSTR_TSC_CLOCK
> + if (pg_current_timing_clock_source() == TIMING_CLOCK_SOURCE_TSC)
> + return "auto (tsc)";
> +#endif
> + return "auto (system)";
> + case TIMING_CLOCK_SOURCE_SYSTEM:
> + return "system";
> +#if PG_INSTR_TSC_CLOCK
> + case TIMING_CLOCK_SOURCE_TSC:
> + return "tsc";
> +#endif

For a moment I was wondering if we should have this display the frequency and
whether it's calibrated. But I think that's too cute by half.

> +static void
> +set_ticks_per_ns(void)
> +{
> +#if PG_INSTR_TSC_CLOCK
> + if (timing_tsc_enabled)
> + set_ticks_per_ns_for_tsc();
> + else
> + set_ticks_per_ns_system();
> +#else
> + set_ticks_per_ns_system();
> +#endif
> +}

How about?

static void
set_ticks_per_ns(void)
{
#if PG_INSTR_TSC_CLOCK
if (timing_tsc_enabled)
{
set_ticks_per_ns_for_tsc();
return;
}
#endif
set_ticks_per_ns_system();
}

> @@ -83,27 +88,90 @@ typedef struct instr_time
> /* Shift amount for fixed-point ticks-to-nanoseconds conversion. */
> #define TICKS_TO_NS_SHIFT 14
>
> -#ifdef WIN32
> -#define PG_INSTR_TICKS_TO_NS 1
> -#else
> -#define PG_INSTR_TICKS_TO_NS 0
> -#endif
> -

I'd add it to the place it'll later be added.

Think this is quite close.

> Subject: [PATCH v20 4/5] pg_test_timing: Also test RDTSC/RDTSCP timing and
> report time source and TSC frequency

> + /* Now, emit fast timing measurements */
> + loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, true);
> + output(loop_count);
> + printf("\n");
> +
> + printf(_("TSC frequency in use: %u kHz\n"), timing_tsc_frequency_khz);
> +
> + calibrated_freq = pg_tsc_calibrate_frequency();
> + if (calibrated_freq > 0)
> + printf(_("TSC frequency from calibration: %u kHz\n"), calibrated_freq);
> + else
> + printf(_("TSC calibration did not converge\n"));

If this were to indicate if the current frequency were from a non-calibration
source it'd be perfect, but that's definitely not required.

> Subject: [PATCH v20 5/5] instrumentation: ARM support for fast time
> measurements
>
> Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces
> use of the cntvct_el0 instruction on ARM systems to access the generic
> timer that provides a synchronized ticks value across CPUs.
>
> Note this adds an exception for Apple Silicon CPUs, due to the observed
> fact that M3 and newer has different timer frequencies for the Efficiency
> and the Performance cores, and we can't be sure where we get scheduled.
>
> To simplify the implementation this does not support Windows on ARM,
> since its quite rare and hard to test.
>
> Relies on the existing timing_clock_source GUC to control whether
> TSC-like timer gets used, instead of system timer.

> +/*
> + * Check whether this is a heterogeneous Apple Silicon P+E core system
> + * where CNTVCT_EL0 may tick at different rates on different core types.
> + */
> +static bool
> +aarch64_has_heterogeneous_cores(void)
> +{
> +#if defined(__APPLE__)
> + int nperflevels = 0;
> + size_t len = sizeof(nperflevels);
> +
> + if (sysctlbyname("hw.nperflevels", &nperflevels, &len, NULL, 0) == 0)
> + return nperflevels > 1;
> +#endif
> +
> + return false;
> +}
> +
> +/*
> + * Detect the generic timer frequency on AArch64.
> + */
> +static void
> +tsc_detect_frequency(void)
> +{
> + if (aarch64_has_heterogeneous_cores())
> + {
> + timing_tsc_frequency_khz = 0;
> + return;
> + }
> +
> + timing_tsc_frequency_khz = aarch64_cntvct_frequency_khz();
> +}

> +/*
> + * The ARM generic timer is architecturally guaranteed to be monotonic and
> + * synchronized across cores of the same type, so we always use it by default
> + * when available and cores are homogenous.
> + */
> +static bool
> +tsc_use_by_default(void)
> +{
> + return true;
> +}

I'm somewhat sceptical of that being viable, given that we only have support
for detect heteregenous cores on macos. You e.g. can run linux on M*
hardware. And I wonder if other big.little heterogeneous architectures have
the same probem...

> +uint32
> +pg_tsc_calibrate_frequency(void)
> +{
> + /* No calibration loop on AArch64; frequency comes from CNTFRQ_EL0 */
> + return 0;
> +}

Think I'd advocate for support for that if/when we add ARM support, even if
it's just to be able to verify things are sane via pg_test_timing.

> @@ -144,7 +150,6 @@ extern bool pg_set_timing_clock_source(TimingClockSourceType source);
> #define PG_INSTR_TICKS_TO_NS 0
> #endif
>
> -
> /* Whether to actually use TSC based on availability and GUC settings. */
> extern PGDLLIMPORT bool timing_tsc_enabled;
>

Spurious line change.

> +#elif defined(__aarch64__) && !defined(WIN32)
> +
> +/*
> + * Read the ARM generic timer counter (CNTVCT_EL0).
> + *
> + * The "fast" variant reads the counter without a barrier, analogous to RDTSC
> + * on x86. The regular variant issues an ISB (Instruction Synchronization
> + * Barrier) first, which acts as a serializing instruction analogous to RDTSCP,
> + * ensuring all preceding instructions have completed before reading the
> + * counter.
> + */
> +static inline instr_time
> +pg_get_ticks_fast(void)
> +{
> + if (likely(timing_tsc_enabled))
> + {
> + instr_time now;
> +
> + now.ticks = __builtin_arm_rsr64("cntvct_el0");
> + return now;

Seems like this is about !msvc (or rather a gcc like compiler), rather than
about windows?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2026-04-07 01:10:56 Re: Adding REPACK [concurrently]
Previous Message Lukas Fittl 2026-04-07 00:39:15 Re: Stack-based tracking of per-node WAL/buffer usage