Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From: Lukas Fittl <lukas(at)fittl(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Hannu Krosing <hannuk(at)google(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, David Geier <geidav(dot)pg(at)gmail(dot)com>
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date: 2026-04-07 03:41:46
Message-ID: CAP53PkwuL5vpURvs9ks-3CeK-M3ZJCDZZ0hnxHryfSadLE7h5g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 6, 2026 at 5:40 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2026-04-06 04:25:36 -0700, Lukas Fittl wrote:
> > On Sun, Apr 5, 2026 at 10:15 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >
> > > - tsc_use_by_default() may be documenting things that aren't the case anymore
> > >
> >
> > I don't think there is a correctness issue, unless you mean the fact
> > that we're not doing the < 8 socket check anymore that is one of the
> > things mentioned on the LKML posts referenced. I think the LKML post
> > references are still helpful as evidence why we trust that Intel has
> > reliable TSC handling.
>
> Well:
> "Mirrors the Linux kernel's clocksource watchdog disable logic"
> doesn't seem quite right, given that in that place we are just checking
> TSC_ADJUST and we don't have the < 8 socket check.
>
> I'd probably say something like 'inspired by ... ' and mention that the rest
> of the check is in tsc_detect_frequency().

Yeah, that makes sense. Reworded.

> I wonder if the cpuid tests should be a bit further abstracted into
> pg_cpu_x86.c.
>
> E.g. instead of tsc_detect_frequency() checking for PG_RDTSCP,
> PG_TSC_INVARIANT, PG_TSC_ADJUST we could have
>
> PG_TSC_AVAILABLE /* RDTSCP & INVARIANT */
> PG_TSC_KNOWN_RELIABLE /* PG_TSC_AVAILABLE && PG_TSC_ADJUST */
> PG_TSC_FREQUENCY_KNOWN /* x86_tsc_frequency_khz works */
>
> and always run all of that during set_x86_features().

I think that could work, but I kept the flags in features closer to
being direct mappings to CPUID bits since that seemed to be intent of
how John designed the facility originally.

John, do you have thoughts on this? (I've not changed it for now)

FWIW, I don't think having PG_TSC_KNOWN_RELIABLE makes sense in any
case, because that would tie together x86_tsc_frequency_khz and
set_x86_features, i.e. you'd either have the frequency return function
modify X86Features later, or always run x86_tsc_frequency_khz when
setting features (and that'd then require you to put the frequency
value somewhere, etc.)

> > > - It's nice that pg_test_timing shows the frequency. I was thinking it were
> > > able to show the result of the calibration, even if we were able to
> > > determine the frequency without calibration. That should make it easier to
> > > figure out whether the calibration works well.
> >
> > Added. I've renamed "tsc_calibrate" to "pg_tsc_calibrate_frequency"
> > and exported that, to support that.
>
> Nice.
>
> Workstation idle:
> TSC frequency in use: 2500000 kHz
> TSC frequency from calibration: 2499519 kHz
>
> Busy:
> TSC frequency in use: 2500000 kHz
> TSC frequency from calibration: 2499262 kHz
>
> Completely overwhelmed (load >1200):
> TSC frequency in use: 2500000 kHz
> TSC frequency from calibration: 2499405 kHz
>
> That's very much good enough.
>

Nice, thanks for confirming!

> > > - Wonder if some of the code would look a bit cleaner if timing_tsc_enabled,
> > > timing_tsc_frequency_khz were defined regardless of PG_INSTR_TSC_CLOCK.
> >
> > Yeah, I don't see harm in defining them always, and its easier on the
> > eyes. Done. Likewise, I've also made the timing_tsc_frequency_khz in
> > BackendParameters defined always.
>
> Nice.
>
> One thing this reminded me of is that pg_set_timing_clock_source() does:
>
> bool
> pg_set_timing_clock_source(TimingClockSourceType source)
> {
> Assert(timing_initialized);
>
> #if PG_INSTR_TSC_CLOCK
> pg_initialize_timing_tsc();
>
> switch (source)
> {
> case TIMING_CLOCK_SOURCE_AUTO:
> timing_tsc_enabled = (timing_tsc_frequency_khz > 0) && tsc_use_by_default();
> break;
> case TIMING_CLOCK_SOURCE_SYSTEM:
> timing_tsc_enabled = false;
> break;
> case TIMING_CLOCK_SOURCE_TSC:
> /* Tell caller TSC is not usable */
> if (timing_tsc_frequency_khz <= 0)
> return false;
> timing_tsc_enabled = true;
> break;
> }
> #endif
>
> set_ticks_per_ns();
> timing_clock_source = source;
> return true;
> }
>
>
> Which means that if building without PG_INSTR_TSC_CLOCK and called with
> TIMING_CLOCK_SOURCE_TSC, it'd return true despite having done something bogus.

Right, that is slightly odd that we allow that.

I think the easiest way to make that very clear is to hide the
existence of the TIMING_CLOCK_SOURCE_TSC enum value behind an
PG_INSTR_TSC_CLOCK gate. I think that's permissible for this kind of
enum, and we also do this for the GUC value already anyway, and we had
no references to that value that weren't already behind a
PG_INSTR_TSC_CLOCK check.

Done that way.

That also required moving some of the #defines in instr_time.h around
so they are defined for defining TimingClockSourceType. I've opted to
have PG_INSTR_TSC_CLOCK/PG_INSTR_TICKS_TO_NS defined early in the
file, but put the clock source names now consistently next to the code
that implements pg_get_ticks. I think that flows nicely now.

>
> I was also wondering if there an argument for moving the
> pg_initialize_timing_tsc() into the relevant switch() cases, so the
> calibration doesn't run if configured with TIMING_CLOCK_SOURCE_SYSTEM. But I
> think because during GUC initialization we'll be called with the builtin
> value, that wouldn't change anything?

Yeah, unless the default was something different than "auto", we can't
get out of having to run "pg_initialize_timing_tsc". I think moving
pg_initialize_timing_tsc into the individual switch cases makes the
code more verbose without saving much, so I've left this as is for
now.

>
>
> > >
> > > - is it ok that we are doing pg_cpuid_subleaf() without checking the result?
> > >
> > > It's not clear to me if a failed __get_cpuid_count() would clear out the old
> > > reg or leave it in place.
> >
> > Hm, maybe best if we just memset reg in pg_cpuid_subleaf
> > unconditionally, before calling __get_cpuid_count / __cpuidex.
>
> Yea, I think that'd be safer.

FWIW, done that way in the v20 version already, I just didn't say that clearly.

>
> > > - How much do we care about weird results when dynamically changing
> > > timing_clocksource?
> > >
> > > postgres[182055][1]=# EXPLAIN ANALYZE SELECT set_config('timing_clock_source', 'tsc', true), pg_sleep(1), set_config('timing_clock_source', 'system', true), pg_sleep(1);
> > > ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> > > │ QUERY PLAN │
> > > ├────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> > > │ Result (cost=0.00..0.01 rows=1 width=72) (actual time=-6540570569.396..-6540570569.395 rows=1.00 loops=1) │
> > > │ Planning Time: 0.184 ms │
> > > │ Execution Time: -6540570569.355 ms │
> > > └────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> > > (3 rows)
> > >
> > > Time: 2002.350 ms (00:02.002)
> >
> > That was brought up earlier in the thread as well, and I added a code
> > comment in response.
>
> Should there be a comment in the docs about it?

Lets do it. I've added the following note to the docs:

Changing the setting during query execution is not recommended
and may cause interval timings to jump significantly or produce negative values.

>
> > I think the trade-off here is that if we make this a more restrictive
> > GUC level (the main solution I can think of), we take away the ability
> > for users to confirm whether the new timing logic caused their timings
> > to be inaccurate.
>
> Yea, it is very useful. I guess an inbetween could be to make it SUSET.
>
> Is there an argument that a user could hide the cost of their queries from
> things like pg_stat_statements and that therefore it should be SUSET?

Yeah, that's a valid point - per your test it could allow unprivileged
users to pretend their expensive operations are cheap, or potentially
corrupt aggregated timings intentionally. And I think the likely
person testing timing sources is an operator anyway (i.e. this is not
intended for an application to change, etc), so we're not losing much
value.

I've adjusted it to SUSET and added a note in the docs its superuser only.

>
> > And it seems very unlikely that someone would actually change the GUC within
> > a query (or within a function).
>
> I agree. I think it's worth mentioning and worth thinking about whether it
> needs to be SUSET (re the question above), but I don't see an argument for
> making it PGC_SIGHUP or such.
>
> Not for now, but I think it'd be nice if the GUC framework had a way of
> expressing that some settings can only be changed at the top-level.

Agreed, that'd be useful to have.

>
> > > - 'tsc' describes just x86-64, even if there is a patch to support aarch64.
> > > Perhaps it'd be enough to sprinkle a few "E.g. on x86-64, ..." around.
> >
> > Hmm. I'm not sure how we can improve that really by adding "E.g."
> > somewhere, but maybe I don't follow.
>
> + <literal>tsc</literal> (measures timing using the x86-64 Time-Stamp Counter (TSC)
> + by directly executing RDTSC/RDTSCP instructions, see below)
>
> If that instead is something like 'tsc' (measures timing with a CPU
> instruction, e.g. using RDTSC/RDTSCP on x86-64) it would not be wrong even
> after adding aarch64 support.
>

Thanks, that is a good rewording - done.

>
> > What I could see us doing is explicitly calling out that TSC is not
> > supported on other architectures?
>
> Yea, I think it'd be good to mention that.
>

I've gone ahead and rewritten that whole paragraph for clarity, and
also split it into two. Feedback welcome:

<para>
If enabled, the TSC clock source will use specialized CPU instructions
when measuring time intervals. This lowers timing overhead compared to
reading the OS system clock, and reduces the measurement error on top
of the actual runtime, for example with EXPLAIN ANALYZE.
</para>
<para>
On x86-64 CPUs the TSC clock source utilizes the Time-Stamp Counter (TSC)
of the CPU. The RDTSC instruction is used to read the TSC for EXPLAIN ANALYZE.
For timings that require higher precision the RDTSCP instruction is used,
which avoids inaccuracies due to CPU instruction re-ordering. Use of
RDTSC/RDTSCP is not supported on older x86-64 CPUs or hypervisors that don't
pass the TSC frequency to guest VMs, and is not advised on systems that
utilize an emulated TSC. The TSC clock source is currently not supported on
other architectures.
</para>
<para>
To help decide which clock source to use you can run the
<application>pg_test_timing</application>
utility to check TSC availability, and perform timing measurements.
</para>

>
> > Subject: [PATCH v20 1/5] instrumentation: Streamline ticks to nanosecond
> > conversion across platforms
>
> ...
>
> Leaving aside that I don't think it makes sense to push this without also
> pushing 0002/0003, I think this is eady.
>

Great!

>
> > Subject: [PATCH v20 2/5] Allow retrieving x86 TSC frequency/flags from CPUID
> >
>
> > +/*
> > + * Determine the TSC frequency of the CPU through CPUID, where supported.
> > + *
> > + * Needed to interpret the tick value returned by RDTSC/RDTSCP. Return value of
> > + * 0 indicates the frequency information was not accessible via CPUID.
> > + */
> > +uint32
> > +x86_tsc_frequency_khz(void)
> > +{
> > + unsigned int reg[4] = {0};
> > +
> > + if (x86_feature_available(PG_HYPERVISOR))
> > + return x86_hypervisor_tsc_frequency_khz();
>
>
> Is there a point in checking whether the things below are present if the
> hypervisor specific logic doesn't find a freq? I think it can be configured
> to be passed through on some hypervisor / cpu combinations.
>

Yeah, when I wrote that I pondered the same question. I don't see harm
in falling back to the below if the hypervisor frequency is 0.
Adjusted that way.

>
> I think this is also close to ready, except for the minor details I raised at
> the start and just here.
>

Great. Thank you for the thorough review, as always :)

> > clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with
> > ANALYZE and TIMING ON. Tests showed that runtime when instrumented can be
> > reduced by up to 10% for queries moving lots of rows through the plan.
>
> FWIW, I see considerably bigger gains in some cases. Mostly queries with many
> query "levels". But even some simple ones:
>
>
> Baseline:
>
> SELECT * FROM pgbench_accounts LIMIT 1 OFFSET 10000000;
> \timing reports 322.548 ms
>
> Baseline with EXPLAIN ANALYZE overhead:
>
>
> EXPLAIN (ANALYZE, BUFFERS 0, TIMING OFF) SELECT * FROM pgbench_accounts LIMIT 1 OFFSET 10000000;
>
> QUERY PLAN
> Limit (cost=168370.00..168370.02 rows=1 width=97) (actual rows=0.00 loops=1)
> -> Seq Scan on pgbench_accounts (cost=0.00..168370.00 rows=10000000 width=97) (actual rows=10000000.00 loops=1)
> Planning Time: 0.059 ms
> Execution Time: 426.570 ms
>
> 1.32 x slowdown.
>
>
> SET timing_clock_source = 'system';
> EXPLAIN (ANALYZE, BUFFERS 0) SELECT * FROM pgbench_accounts LIMIT 1 OFFSET 10000000;
>
> Limit (cost=168370.00..168370.02 rows=1 width=97) (actual time=882.843..882.843 rows=0.00 loops=1)
> -> Seq Scan on pgbench_accounts (cost=0.00..168370.00 rows=10000000 width=97) (actual time=0.021..593.587 rows=10000000.00 loops=1)
> Planning Time: 0.063 ms
> Execution Time: 882.860 ms
>
> 2.06 x slowdown relative to TIMING OFF
>
>
> SET timing_clock_source = 'tsc';
> Limit (cost=168370.00..168370.02 rows=1 width=97) (actual time=543.098..543.098 rows=0.00 loops=1)
> -> Seq Scan on pgbench_accounts (cost=0.00..168370.00 rows=10000000 width=97) (actual time=0.017..413.878 rows=10000000.00 loops=1)
> Planning Time: 0.061 ms
> Execution Time: 543.122 ms
>
> 1.27 x slowdown relative to TIMING OFF
>
> 1.63x speedup relative to system.
>
>
> But I also see ~20% gains for some TPCH queries, for example.

Thanks for some fresh numbers! I've reworded this now as:

This reduces the overhead of EXPLAIN with ANALYZE and TIMING ON. Tests showed
that the overhead on top of actual runtime when instrumenting queries moving
lots of rows through the plan can be reduced from 2x as slow to 1.2x as slow
compared to the actual runtime. More complex workloads such as TPCH queries
have also shown ~20% gains when instrumented compared to before.

>
>
> > To control use of the TSC, the new "timing_clock_source" GUC is introduced,
> > whose default ("auto") automatically uses the TSC when running on Linux/x86-64,
> > in case the system clocksource is reported as "tsc". The use of the system
> > APIs can be enforced by setting "system", or on x86-64 architectures the
> > use of TSC can be enforced by explicitly setting "tsc".
>
> It's more widely enabled by default now, right?

Reworded as:

To control use of the TSC, the new "timing_clock_source" GUC is introduced,
whose default ("auto") automatically uses the TSC when reliable, for example
when running on modern Intel CPUs, or when running on Linux and the system
clocksource is reported as "tsc". The use of the operating system clock
source can be enforced by setting "system", or on x86-64 architectures
the use of TSC can be enforced by explicitly setting "tsc".

>
> > In order to use the TSC the frequency is first determined by use of CPUID,
> > and if not available, by running a short calibration loop at program start,
> > falling back to the system time if TSC values are not stable.
> >
> > Note, that we split TSC usage into the RDTSC CPU instruction which does not
> > wait for out-of-order execution (faster, less precise) and the RDTSCP instruction,
> > which waits for outstanding instructions to retire. RDTSCP is deemed to have
> > little benefit in the typical InstrStartNode() / InstrStopNode() use case of
> > EXPLAIN, and can be up to twice as slow. To separate these use cases, the new
> > macro INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC.
> >
> > The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed
> > to be used when precision is more important than performance. When the
> > system timing clock source is used both of these macros instead utilize
> > the system APIs (clock_gettime / QueryPerformanceCounter) like before.
>
> Maybe worth adding that there are other things that may be worth converting,
> like track_io_timing/track_wal_io_timing.

I've added the following:

Additional users of interval timing, such as track_io_timing and
track_wal_io_timing could also benefit from being converted to use
INSTR_TIME_SET_CURRENT_FAST() but are left for a future change.

>
> > +const char *
> > +show_timing_clock_source(void)
> > +{
> > + switch (timing_clock_source)
> > + {
> > + case TIMING_CLOCK_SOURCE_AUTO:
> > +#if PG_INSTR_TSC_CLOCK
> > + if (pg_current_timing_clock_source() == TIMING_CLOCK_SOURCE_TSC)
> > + return "auto (tsc)";
> > +#endif
> > + return "auto (system)";
> > + case TIMING_CLOCK_SOURCE_SYSTEM:
> > + return "system";
> > +#if PG_INSTR_TSC_CLOCK
> > + case TIMING_CLOCK_SOURCE_TSC:
> > + return "tsc";
> > +#endif
>
> For a moment I was wondering if we should have this display the frequency and
> whether it's calibrated. But I think that's too cute by half.
>

Yeah, I already felt initially unsure whether the "auto (...)"
mechanism was a bit too novel. I don't think we should put the
frequency/calibration status into the show hook.

>
> > @@ -83,27 +88,90 @@ typedef struct instr_time
> > /* Shift amount for fixed-point ticks-to-nanoseconds conversion. */
> > #define TICKS_TO_NS_SHIFT 14
> >
> > -#ifdef WIN32
> > -#define PG_INSTR_TICKS_TO_NS 1
> > -#else
> > -#define PG_INSTR_TICKS_TO_NS 0
> > -#endif
> > -
>
> I'd add it to the place it'll later be added.

Addressed this by keeping it below TICKS_TO_NS_SHIFT always, and
setting PG_INSTR_TSC_CLOCK in the same spot as well, per the earlier
note of that being needed for the enum definition. I've also added a
comment explaining what these get used for.

> > Subject: [PATCH v20 4/5] pg_test_timing: Also test RDTSC/RDTSCP timing and
> > report time source and TSC frequency
>
>
> > + /* Now, emit fast timing measurements */
> > + loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, true);
> > + output(loop_count);
> > + printf("\n");
> > +
> > + printf(_("TSC frequency in use: %u kHz\n"), timing_tsc_frequency_khz);
> > +
> > + calibrated_freq = pg_tsc_calibrate_frequency();
> > + if (calibrated_freq > 0)
> > + printf(_("TSC frequency from calibration: %u kHz\n"), calibrated_freq);
> > + else
> > + printf(_("TSC calibration did not converge\n"));
>
> If this were to indicate if the current frequency were from a non-calibration
> source it'd be perfect, but that's definitely not required.
>

I mostly didn't want to add yet another extra variable to keep that
information (and per my note above, I don't think that makes sense to
put in x86 features itself). I've left this as-is for now. I think if
people are unsure whether their CPUID gives the necessary information,
they could always run a separate utility like "cpuid" to confirm.

>
> > Subject: [PATCH v20 5/5] instrumentation: ARM support for fast time
> > measurements
> >
> ...
>
> > +/*
> > + * The ARM generic timer is architecturally guaranteed to be monotonic and
> > + * synchronized across cores of the same type, so we always use it by default
> > + * when available and cores are homogenous.
> > + */
> > +static bool
> > +tsc_use_by_default(void)
> > +{
> > + return true;
> > +}
>
> I'm somewhat sceptical of that being viable, given that we only have support
> for detect heteregenous cores on macos. You e.g. can run linux on M*
> hardware. And I wonder if other big.little heterogeneous architectures have
> the same probem...

Ack. I think we could make this dependent on "were we able to
determine that its a homogeneous architecture", and if we weren't, we
default to off.

My understanding is that Apple did something funny here with Apple
Silicon on M3+ (specifically with how they did the timer handling for
the different core types), but I'm not an expert at the ARM
architecture to claim it won't be a problem on other ARM platforms.

>
> > +uint32
> > +pg_tsc_calibrate_frequency(void)
> > +{
> > + /* No calibration loop on AArch64; frequency comes from CNTFRQ_EL0 */
> > + return 0;
> > +}
>
> Think I'd advocate for support for that if/when we add ARM support, even if
> it's just to be able to verify things are sane via pg_test_timing.

We could, but the timer on ARM is actually more trustworthy in the
sense that the frequency is always fixed - until they decided to raise
that always fixed value for modern ARM, and Apple decided to mix the
two variants..

But there is nothing in the TSC frequency calibration that couldn't
allow ARM instructions to be used instead.

---

See attached v21.

I've also marked pg_get_ticks(_fast) as pg_attribute_always_inline,
per an off-list comment from Andres that he observed GCC not fully
inlining that function in pg_test_timing, presumably due to the
likely(..) in it.

Additionally, I reworded the 0001 commit as "instrumentation:
Standardize ticks to nanosecond conversion method" (instead of
"Streamline ticks to nanosecond conversion across platforms"), I think
that's a better headline of what's happening, since in that commit
itself we're still only doing ticks to nanoseconds on Windows.

Thanks,
Lukas

--
Lukas Fittl

Attachment Content-Type Size
v21-0005-instrumentation-ARM-support-for-fast-time-measur.patch application/octet-stream 8.1 KB
v21-0002-Allow-retrieving-x86-TSC-frequency-flags-from-CP.patch application/octet-stream 7.3 KB
v21-0003-instrumentation-Use-Time-Stamp-Counter-TSC-on-x8.patch application/octet-stream 31.9 KB
v21-0001-instrumentation-Standardize-ticks-to-nanosecond-.patch application/octet-stream 16.8 KB
v21-0004-pg_test_timing-Also-test-RDTSC-RDTSCP-timing-and.patch application/octet-stream 7.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2026-04-07 03:43:43 Re: pg_get__*_ddl consolidation
Previous Message shveta malik 2026-04-07 03:41:10 Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication