Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From: Lukas Fittl <lukas(at)fittl(dot)com>
To: David Geier <geidav(dot)pg(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Hannu Krosing <hannuk(at)google(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date: 2026-01-31 20:11:33
Message-ID: CAP53PkyooCeR8YV0BUD_xC7oTZESHz8OdA=tP7pBRHFVQ9xtKg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 11, 2026 at 11:26 AM David Geier <geidav(dot)pg(at)gmail(dot)com> wrote:
>
> > Based on Robert's suggestion I wanted to add a "fast_clock_source" enum
> > GUC which can have the following values "auto", "rdtsc", "try_rdtsc" and
> > "off". With that, at least no additional checks are needed and
> > performance will remain as previously benchmarked in this thread.
>
> The attached patch set is rebased on latest master and contains a commit
> which adds a "fast_clock_source" GUC that can be "try", "off" and
> "rdtsc" on Linux.
>
> Alternatively, we could call the GUC "clock_source" with "auto",
> "clock_gettime" and "rdtsc". Opinions?

No strong opinion on the GUC name ("fast_clock_source" seems fine?),
but I think "try" is a bit confusing if our logic is more than just
checking if the RDTSC(P) instruction is available, so I'd be in favor
of "auto" as the default value.

> I moved the call to INSTR_TIME_INITIALIZE() from InitPostgres() to
> PostmasterMain(). In InitPostgres() it kept the database in a recovery
> cycle.

I think we can actually avoid having anything in PostmasterMain (or
InitPostgres), and instead rely on the GUC assign mechanism.

I've reworked the patch a bit more, see attached v4, with a couple of
noticeable changes:

In regards to the GUC:
- Use the GUC check mechanism to complain if RDTSC clock source is
requested, but its not available
- Use the GUC assign mechanism to set whether we're actually using the
RDTSC clock source
- "auto" now means that we use RDTSC clock source by default if we're
on Linux x86, and the system clocksource is "tsc"
- "rdtsc" now allows using RDTSC on any x86-based Unix-like systems (I
see no reason to restrict the BSDs from using RDTSC when setting it
explicitly)
- Allow changing the clock source GUC at any time, without requiring a
restart (it makes testing much easier, and I don't see a good reason
to require a restart, or even restrict this to superuser?)
- Have pg_test_timing emit whether a fast clock source will be used by
default (or whether one needs to change the GUC)

Additionally:
- If a client program wants to use the fast clock source (like
pg_test_timing does), it first needs to call
pg_initialize_fast_clock_source() -- this replaces the
INSTR_TIME_INITIALIZE calls.
- I've re-introduced a patch (0001) to set HAVE__CPUIDEX on modern
GCC/clang. That's necessary to make this work on VM Hypervisors (per
the patch's commit message)
- I've merged the GUC patch together with the patch that adds the
RDTSC implementation (0002), I don't think that makes sense to review
or commit separately.
- I've unified the RDTSC and RDTSCP handling, so we require both in
order to use TSC as a time source. Because we have the shared
pg_ticks_to_ns() function that gets used on an instr_time regardless
of fast vs "slow" timing, and the variables used in that function are
affected by the RDTSC availability, we must use TSC consistently - I
don't think we can mix RDTSC for fast and pg_clock_gettime() for slow,
as this patch series has done so far.

Open questions for me:
- I'm seeing a CI test failure for "Linux - Debian Trixie - Meson"
(times out), but its not clear if this is a fluke - I'll check if this
recurs on the commitfest patch
- We're doing a lot of work in pg_ticks_to_ns, even when we're not
using RDTSC - and I think that shows in a slightly slower
pg_test_timing measurement compared to master when fast clock source
is off. Can we somehow only do that when we use RDTSC?

Here is a fresh test run with this patch on an AWS c6i.xlarge, i.e.
Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz / "Ice Lake":

CREATE TABLE test (id int);
INSERT INTO test SELECT * FROM generate_series(0, 1000000);

postgres=# SET fast_clock_source = off;
SET
Time: 0.107 ms
postgres=# EXPLAIN ANALYZE SELECT COUNT(*) FROM test;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=10633.55..10633.56 rows=1 width=8) (actual
time=44.117..44.811 rows=1.00 loops=1)
Buffers: shared hit=846 read=3579
-> Gather (cost=10633.34..10633.55 rows=2 width=8) (actual
time=44.060..44.804 rows=3.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=846 read=3579
-> Partial Aggregate (cost=9633.34..9633.35 rows=1 width=8)
(actual time=42.129..42.130 rows=1.00 loops=3)
Buffers: shared hit=846 read=3579
-> Parallel Seq Scan on test (cost=0.00..8591.67
rows=416667 width=0) (actual time=0.086..21.595 rows=333333.67
loops=3)
Buffers: shared hit=846 read=3579
Planning Time: 0.043 ms
Execution Time: 44.836 ms
(12 rows)

Time: 45.076 ms

postgres=# SET fast_clock_source = rdtsc;
SET
Time: 0.123 ms
postgres=# EXPLAIN ANALYZE SELECT COUNT(*) FROM test;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=10633.55..10633.56 rows=1 width=8) (actual
time=32.943..33.912 rows=1.00 loops=1)
Buffers: shared hit=1128 read=3297
-> Gather (cost=10633.34..10633.55 rows=2 width=8) (actual
time=32.868..33.906 rows=3.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1128 read=3297
-> Partial Aggregate (cost=9633.34..9633.35 rows=1 width=8)
(actual time=30.705..30.706 rows=1.00 loops=3)
Buffers: shared hit=1128 read=3297
-> Parallel Seq Scan on test (cost=0.00..8591.67
rows=416667 width=0) (actual time=0.080..15.223 rows=333333.67
loops=3)
Buffers: shared hit=1128 read=3297
Planning Time: 0.042 ms
Execution Time: 33.935 ms
(12 rows)

Time: 34.180 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM test;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=10633.55..10633.56 rows=1 width=8) (actual
rows=1.00 loops=1)
Buffers: shared hit=1410 read=3015
-> Gather (cost=10633.34..10633.55 rows=2 width=8) (actual
rows=3.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1410 read=3015
-> Partial Aggregate (cost=9633.34..9633.35 rows=1 width=8)
(actual rows=1.00 loops=3)
Buffers: shared hit=1410 read=3015
-> Parallel Seq Scan on test (cost=0.00..8591.67
rows=416667 width=0) (actual rows=333333.67 loops=3)
Buffers: shared hit=1410 read=3015
Planning Time: 0.042 ms
Execution Time: 27.876 ms
(12 rows)

Time: 28.135 ms

Thanks,
Lukas

--
Lukas Fittl

Attachment Content-Type Size
v4-0002-Use-time-stamp-counter-to-measure-time-on-Linux-x.patch application/octet-stream 20.6 KB
v4-0003-pg_test_timing-Also-test-fast-timing-and-report-t.patch application/octet-stream 8.2 KB
v4-0001-Check-for-HAVE__CPUIDEX-and-HAVE__GET_CPUID_COUNT.patch application/octet-stream 6.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-01-31 20:58:34 Re: slow SELECT expr INTO var in plpgsql
Previous Message Nikolay Samokhvalov 2026-01-31 19:51:39 Re: IO wait events for COPY FROM/TO PROGRAM or file