| From: | Lukas Fittl <lukas(at)fittl(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | John Naylor <johncnaylorls(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Hannu Krosing <hannuk(at)google(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, David Geier <geidav(dot)pg(at)gmail(dot)com> |
| Subject: | Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |
| Date: | 2026-03-11 09:11:17 |
| Message-ID: | CAP53PkxBr6HChb3LxpuPEgiBRPcsRdTHXCbZuRLmyusG-NMXFA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
Attached v11, with the following changes:
0001 is a new patch that implements the refactorings of the CPUID code
suggested by John.
0002 is the existing patch for supporting using __cpuidex directly
(needed by the TSC hypervisor frequency code).
0003 is the existing patch as before to optimize pg_test_timing.
0004 is almost identical to the previous patch (v10/0003) that adds
the ticks to NS conversion, with a small improvement to use an
explicit define (PG_INSTR_TICKS_TO_NS) that controls whether we go
into the complex pg_ticks_to_ns logic at all.
0005 is the TSC patch with the following changes:
- Dropped the HyperV MSR read again, per Andres feedback
- Added a TSC calibration loop that is used if we can't get the
frequency from CPUID. This is based on a script that Andres shared
off-list, and works both on HyperV as well as my bare metal AMD CPU.
Note that we don't utilize this on Windows to avoid the delay for the
calibration to converge (< 50ms, typically less than 1ms) penalizing
connection start (since we don't get the tsc frequency global from
postmaster before the fork)
- Moved the GUC logic to instrument.c, because we shouldn't be
defining GUCs in a file that's built with front-end programs
Its worth noting that I have not yet included a way to pass debug
information back to the user (e.g. when the TSC calibration didn't
converge, the TSC is not invariant, etc), as Jakub suggested
previously - with the TSC calibration code in the picture I'm less
sure its really needed, since e.g. looking at the "cpuid" program will
tell you whether calibration runs or not, and then you could infer
that the calibration failed if you didn't get a usable TSC reported by
pg_test_timing.
0006 is the existing patch as before to add pg_test_timing debug output.
0007 is a new patch that shows how we could expand this to also be
used for ARM, by calling CNTVCT_EL0. I'm mainly adding this because I
think its the main evolution of this that we haven't talked about that
much yet, and even if we do this in a later release cycle it'll help
refine the design. This worked as expected for me on an AWS Graviton
instance, but failed on an Apple Silicon M3 due to quirks with its
Efficiency vs Performance core - dealt with in the patch by not using
the generic timer directly when we're on a homogeneous core system.
Looking at how an ARM implementation could work does make me wonder
one thing in general: Maybe we shouldn't be using the term "tsc" for
"timing_clock_source" (and internal defines), since "TSC" is an x86
term, that doesn't really make sense when we expand this to ARM in a
future release. Maybe we should use a generic name, like "hwtimer",
"direct" or "hardware"?
On Sun, Mar 8, 2026 at 9:39 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Alternatively, we could consider doing it like the Kernel does it for
> > its calibration loop, and wait 1 second of wall time, and then see how
> > far the TSC counter has advanced.
>
> Yea, I think we need a calibration loop, unfortunately. But I think it should
> be doable to make it a lot quicker than waiting one second. I'm thinking of
> something like a loop that measures the the clock cycles and relative time
> (using clock_gettime()) since the start and does so until the frequency
> estimate predicts the time results closely. I think should be a few 10s of
> milliseconds at most.
That is implemented now in 0005, based on the script you shared off
list, which I hopefully translated correctly to the Postgres source.
The one part I wasn't sure about is whether we want to use RDTSCP for
the calibration, or RDTSC+LFENCE like you had it in your script (which
is closer to what the abseil library does, that I mentioned upthread)
- for now I went with RDTSCP to keep it simple.
We run up to 1 million RDTSCP instructions, for at most 50ms, and
terminate once the frequency stays stable for at least 3 iterations.
In testing this converges pretty quickly in practice (<1ms) and
closely matches the reported TSC frequency by the Linux kernel.
Thanks,
Lukas
--
Lukas Fittl
| Attachment | Content-Type | Size |
|---|---|---|
| v11-0001-Refactor-handling-of-x86-CPUID-instructions.patch | application/octet-stream | 3.9 KB |
| v11-0005-instrumentation-Use-Time-Stamp-Counter-TSC-on-x8.patch | application/octet-stream | 38.0 KB |
| v11-0003-pg_test_timing-Reduce-per-loop-overhead.patch | application/octet-stream | 3.6 KB |
| v11-0002-Check-for-HAVE__CPUIDEX-and-HAVE__GET_CPUID_COUN.patch | application/octet-stream | 6.2 KB |
| v11-0004-instrumentation-Streamline-ticks-to-nanosecond-c.patch | application/octet-stream | 11.2 KB |
| v11-0006-pg_test_timing-Also-test-RDTSC-RDTSCP-timing-and.patch | application/octet-stream | 6.2 KB |
| v11-0007-instrumentation-ARM-support-for-fast-time-measur.patch | application/octet-stream | 8.4 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jim Jones | 2026-03-11 09:20:18 | Re: Add XMLNamespaces to XMLElement |
| Previous Message | Ryo Matsumura (Fujitsu) | 2026-03-11 09:08:32 | Re: [PATCH] Reduce EXPLAIN ANALYZE overhead for row counting |