| From: | Lukas Fittl <lukas(at)fittl(dot)com> |
|---|---|
| To: | David Geier <geidav(dot)pg(at)gmail(dot)com> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Hannu Krosing <hannuk(at)google(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |
| Date: | 2026-02-12 16:05:27 |
| Message-ID: | CAP53PkxAtD_ArzxaXPMgAKTfT5Da34o-oN9JY32UbhFS3U=-6A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Feb 4, 2026 at 2:09 AM David Geier <geidav(dot)pg(at)gmail(dot)com> wrote:
> On 03.02.2026 18:44, Andres Freund wrote:
> > Whereas it'll not make sense for anything that needs wall clock times - which
> > imo makes a "clock_source" GUC misnamed. Maybe "clock_source_timing" or such?
>
> Makes sense. clock_source_timing works for me, or maybe easier to read
> would be timing_clock_source. But doesn't matter much.
I've gone with "timing_clock_source" for now, that seems best to me so far.
See attached v6, which has the following changes:
Adds a new commit (0002) that:
- Unifies the Windows QueryPerformanceCounter() implementation with
the TSC implementation, and introduces the pg_ticks_to_ns function and
the ticks_per_ns_scaled conversion ratio in this commit.
- I realized that we were unnecessarily calling
QueryPerformanceFrequency on Windows every time we got the ticks, and
this is the same problem we're solving for the TSC frequency.
- This also allows us to independently test if the overhead of
overflow handling of pg_ticks_to_ns is a problem when clock_gettime is
used (since we're always exceeding max_ticks_no_overflow -- see end of
email), and makes the code read much better, I think.
- The one downside is that this means every program that wants to use
INSTR_* macros now has to call pg_initialize_timing() first. In 0003
we can then rely on the GUC mechanism to do this in the backend, as
before.
For the main TSC commit (now 0003):
- Enable the use of TSC on Windows when set explicitly via the GUC,
since I couldn't find a good reason why not - but note I have not done
any manual testing on Windows yet.
- Refactoring to improve readability and better split TSC logic from
general timing clock source logic
In regards to the GUC (part of 0003):
- Renamed to "timing_clock_source", with three settings:
- "auto" that will use the TSC clock source if we're on Linux and
Linux itself uses the TSC clock source
- "system" will force use of the system APIs, i.e. clock_gettime()
or QueryPerformanceCounter() -- this was named "off" before, but I
think "system" is more clear
- "tsc" will force the use of RDTSC/RDTSCP on x86-64, and will fail
if it is not available. Not a possible setting on other architectures.
-- this was named "rdtsc" before, but I think "tsc" is better, since
we use a mix of RDTSC and RDTSCP
- Resolves the ubsan issue in CI, which was caused by a missing
addition to "config_group_names" for the new "Resource / Time" GUC
group. I wonder if we should make this "Resource / Other" instead
though? (it seems unlikely we'll have another GUC for time
specifically?)
- Implements a show hook for the GUC that will show the current value
in parenthesis with auto is selected. Is this sufficient to address
the use case of wanting to know the current clock source?
postgres=# SHOW timing_clock_source;
timing_clock_source
---------------------
auto (tsc)
For the pg_test_timing change (0004):
- If available, show both RDTSC and RDTSCP timings (RDTSC indicated as
"Fast"), as well as the system clock source, to help decide whether to
enable TSC.
FWIW, I have not yet investigated expanding the use of fast timing to
other places (e.g. track_io_timing/track_wal_io_timing) as suggested.
Regarding the overhead introduced with pg_ticks_to_ns:
On master (88327092ff0), I'm getting 23.54 ns from pg_test_timing - vs
with 0002 applied, this slows to 25.74 ns. I've tried to see if the
"unlikely(..)" we added in pg_ticks_to_ns is the problem (since in the
clock_gettime() case we'd always be running into that branch due to
the size of the nanoseconds value), but no luck - I think the extra
multiplication/division itself is the problem.
Any ideas how we could do this differently?
Thanks,
Lukas
--
Lukas Fittl
| Attachment | Content-Type | Size |
|---|---|---|
| v6-0001-Check-for-HAVE__CPUIDEX-and-HAVE__GET_CPUID_COUNT.patch | application/x-patch | 6.7 KB |
| v6-0002-Timing-Always-perform-ticks-to-nanosecond-convers.patch | application/x-patch | 11.1 KB |
| v6-0004-pg_test_timing-Also-test-RDTSC-RDTSCP-timing-and-.patch | application/x-patch | 7.1 KB |
| v6-0003-Timing-Use-Time-Stamp-Counter-TSC-on-x86-64-for-f.patch | application/x-patch | 24.4 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2026-02-12 16:17:37 | Our ABI diff infrastructure ignores enum SysCacheIdentifier |
| Previous Message | Tom Lane | 2026-02-12 15:53:46 | Re: pgsql: Add file_extend_method=posix_fallocate,write_zeros. |