Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From: Lukas Fittl <lukas(at)fittl(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: John Naylor <johncnaylorls(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Hannu Krosing <hannuk(at)google(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, David Geier <geidav(dot)pg(at)gmail(dot)com>
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date: 2026-04-06 11:25:36
Message-ID: CAP53PkyQbtXrEeBZiM1Zrp4QmKd2DH2oWS9ryd+DGvSu=NnVDA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 5, 2026 at 10:15 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> - tsc_use_by_default() may be documenting things that aren't the case anymore
>

I don't think there is a correctness issue, unless you mean the fact
that we're not doing the < 8 socket check anymore that is one of the
things mentioned on the LKML posts referenced. I think the LKML post
references are still helpful as evidence why we trust that Intel has
reliable TSC handling.

I did note a bit of a grammar oddity at the end of the comment, fixed that.

> - It may be paranoia, but it seems like tsc_calibrate() should perhaps save
> the old clock source and restore it at the end?

Sure, seems reasonable. Done.

>
> - A brief comment in the source (rather than just the thread), why
> TSC_CALIBRATION_SKIPS exists would be nice
>

Added. I've also split the two different kinds of skipping into their
own if conditions for clarity.

>
> - Should pg_initialize_timing() allow repeated initialization? Seems like that
> would normally be a bug?

I'm trying to recall if restore_backend_variables might have a problem
if we didn't allow that? (since we call pg_initialize_timing there,
which I think is due to ordering)

>
> - It's nice that pg_test_timing shows the frequency. I was thinking it were
> able to show the result of the calibration, even if we were able to
> determine the frequency without calibration. That should make it easier to
> figure out whether the calibration works well.

Added. I've renamed "tsc_calibrate" to "pg_tsc_calibrate_frequency"
and exported that, to support that.

FWIW, this now calibrates twice in pg_test_timing if we're on a system
that has to use calibration. If we wanted to avoid that, we could
introduce some kind of flag that indicated the TSC frequency was
already determined through calibration. Not sure if needed?

>
> - Wonder if some of the code would look a bit cleaner if timing_tsc_enabled,
> timing_tsc_frequency_khz were defined regardless of PG_INSTR_TSC_CLOCK.

Yeah, I don't see harm in defining them always, and its easier on the
eyes. Done. Likewise, I've also made the timing_tsc_frequency_khz in
BackendParameters defined always.

I've left pg_initialize_timing_tsc gated by PG_INSTR_TSC_CLOCK for
now, but we could also consider making that always existing, and just
not do anything.

>
> - is it ok that we are doing pg_cpuid_subleaf() without checking the result?
>
> It's not clear to me if a failed __get_cpuid_count() would clear out the old
> reg or leave it in place.

Hm, maybe best if we just memset reg in pg_cpuid_subleaf
unconditionally, before calling __get_cpuid_count / __cpuidex.

With that, we can be sure to have zero values for our subsequent
checks, in case it failed.

> - I don't think the elog() in pg_initialize_timing_tsc() works in any common
> scenario, as log_min_messages hasn't been initialized yet, so that message
> won't ever be emitted.
>
> If it used a different log level, it'd be emitted without knowing the
> log_line_prefix etc, because that hasn't yet been initialized.

Ah. Hmm. If I follow correctly, that's because it gets called from a
GUC check hook, and log_min_messages will only get initialized later.

I've removed it again for now.

>
> - How much do we care about weird results when dynamically changing
> timing_clocksource?
>
> postgres[182055][1]=# EXPLAIN ANALYZE SELECT set_config('timing_clock_source', 'tsc', true), pg_sleep(1), set_config('timing_clock_source', 'system', true), pg_sleep(1);
> ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │ QUERY PLAN │
> ├────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Result (cost=0.00..0.01 rows=1 width=72) (actual time=-6540570569.396..-6540570569.395 rows=1.00 loops=1) │
> │ Planning Time: 0.184 ms │
> │ Execution Time: -6540570569.355 ms │
> └────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> (3 rows)
>
> Time: 2002.350 ms (00:02.002)

That was brought up earlier in the thread as well, and I added a code
comment in response.

I think the trade-off here is that if we make this a more restrictive
GUC level (the main solution I can think of), we take away the ability
for users to confirm whether the new timing logic caused their timings
to be inaccurate. And it seems very unlikely that someone would
actually change the GUC within a query (or within a function).

That said, if we think that's a problem, we should just raise the GUC
level. I don't think it'd break anything, its just less convenient.

>
>
> - SGML:
>
> - "modern CPUs" seems a bit pejorative, at least if we don't merge the
> aarch64 stuff.

I've revised this to "supported x86-64 CPUs".

> - 'tsc' describes just x86-64, even if there is a patch to support aarch64.
> Perhaps it'd be enough to sprinkle a few "E.g. on x86-64, ..." around.

Hmm. I'm not sure how we can improve that really by adding "E.g."
somewhere, but maybe I don't follow.

What I could see us doing is explicitly calling out that TSC is not
supported on other architectures?

>
> - I wonder if it'd be useful to have a INSTR_TIME_SET_CURRENT_SYSTEM() that'd
> always use the "system" method, but convert it to to ticks if appropriate.
>
> E.g. test_tsc_timing() could measure the elapsed time with system and
> display the difference in how long the test_timing() takes between tsc based
> source and system time.
>
> This might be more of a thing for later.

Sure, I think if the conversion happened right away (i.e. we keep the
stored internal ticks value consistent with the active source) that
would be safe to do. I agree we can defer this until later.

---

Attached v20 with feedback addressed.

Thanks,
Lukas

--
Lukas Fittl

Attachment Content-Type Size
v20-0005-instrumentation-ARM-support-for-fast-time-measur.patch application/octet-stream 8.3 KB
v20-0002-Allow-retrieving-x86-TSC-frequency-flags-from-CP.patch application/octet-stream 7.3 KB
v20-0001-instrumentation-Streamline-ticks-to-nanosecond-c.patch application/octet-stream 16.6 KB
v20-0003-instrumentation-Use-Time-Stamp-Counter-TSC-on-x8.patch application/octet-stream 30.9 KB
v20-0004-pg_test_timing-Also-test-RDTSC-RDTSCP-timing-and.patch application/octet-stream 7.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2026-04-06 11:39:44 Re: pg_get__*_ddl consolidation
Previous Message Antonin Houska 2026-04-06 11:23:57 Re: Adding REPACK [concurrently]