| From: | David Geier <geidav(dot)pg(at)gmail(dot)com> |
|---|---|
| To: | Lukas Fittl <lukas(at)fittl(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Hannu Krosing <hannuk(at)google(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |
| Date: | 2026-02-04 10:02:42 |
| Message-ID: | adbdd9b2-c012-496c-9f09-b11f81e7ec17@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Lukas!
On 31.01.2026 21:11, Lukas Fittl wrote:
> On Sun, Jan 11, 2026 at 11:26 AM David Geier <geidav(dot)pg(at)gmail(dot)com> wrote:
>>
>>> Based on Robert's suggestion I wanted to add a "fast_clock_source" enum
>>> GUC which can have the following values "auto", "rdtsc", "try_rdtsc" and
>>> "off". With that, at least no additional checks are needed and
>>> performance will remain as previously benchmarked in this thread.
>>
>> The attached patch set is rebased on latest master and contains a commit
>> which adds a "fast_clock_source" GUC that can be "try", "off" and
>> "rdtsc" on Linux.
>>
>> Alternatively, we could call the GUC "clock_source" with "auto",
>> "clock_gettime" and "rdtsc". Opinions?
>
> No strong opinion on the GUC name ("fast_clock_source" seems fine?),
> but I think "try" is a bit confusing if our logic is more than just
> checking if the RDTSC(P) instruction is available, so I'd be in favor
> of "auto" as the default value.
>
>> I moved the call to INSTR_TIME_INITIALIZE() from InitPostgres() to
>> PostmasterMain(). In InitPostgres() it kept the database in a recovery
>> cycle.
>
> I think we can actually avoid having anything in PostmasterMain (or
> InitPostgres), and instead rely on the GUC assign mechanism.
>
Good idea. Hadn't realized check and assign hooks are always called when
starting up PostgreSQL.
> I've reworked the patch a bit more, see attached v4, with a couple of
> noticeable changes:
Great. Thanks!
> In regards to the GUC:
> - Use the GUC check mechanism to complain if RDTSC clock source is
> requested, but its not available
> - Use the GUC assign mechanism to set whether we're actually using the
> RDTSC clock source
Nice!
> - "auto" now means that we use RDTSC clock source by default if we're
> on Linux x86, and the system clocksource is "tsc"
Not that I care much but I picked "try" because it's consistent with
"try" for the huge_pages GUC. What's your motivation behind "auto"?
We still need to add the new GUC to the documentation. We should mention
that with RDTSC it can be possible to get subpar performance in some
environments (e.g. with emulated TSC).
> - "rdtsc" now allows using RDTSC on any x86-based Unix-like systems (I
> see no reason to restrict the BSDs from using RDTSC when setting it
> explicitly)
Why then not also Windows, or just any x86_64-based operating system? In
reality it's a CPU feature, not an OS feature.
> - Allow changing the clock source GUC at any time, without requiring a
> restart (it makes testing much easier, and I don't see a good reason
> to require a restart, or even restrict this to superuser?)
Not requiring a restart makes sense.
Not sure about allowing any user to set it. I thought system
configuration GUCs we keep restricted to the superuser because the admin
would know best if RDTSC is actually faster or not.
> - Have pg_test_timing emit whether a fast clock source will be used by
> default (or whether one needs to change the GUC)
That's useful.
> Additionally:
> - If a client program wants to use the fast clock source (like
> pg_test_timing does), it first needs to call
> pg_initialize_fast_clock_source() -- this replaces the
> INSTR_TIME_INITIALIZE calls.
> - I've re-introduced a patch (0001) to set HAVE__CPUIDEX on modern
> GCC/clang. That's necessary to make this work on VM Hypervisors (per
> the patch's commit message)
Thanks. I had taken this out accidentally when rebasing.
> - I've merged the GUC patch together with the patch that adds the
> RDTSC implementation (0002), I don't think that makes sense to review
> or commit separately.
Makes sense. Just separated it last time to make it easier to see the
changes.
> - I've unified the RDTSC and RDTSCP handling, so we require both in
> order to use TSC as a time source. Because we have the shared
> pg_ticks_to_ns() function that gets used on an instr_time regardless
> of fast vs "slow" timing, and the variables used in that function are
> affected by the RDTSC availability, we must use TSC consistently - I
> don't think we can mix RDTSC for fast and pg_clock_gettime() for slow,
> as this patch series has done so far.
Makes sense.
> Open questions for me:
> - I'm seeing a CI test failure for "Linux - Debian Trixie - Meson"
> (times out), but its not clear if this is a fluke - I'll check if this
> recurs on the commitfest patch
> - We're doing a lot of work in pg_ticks_to_ns, even when we're not
> using RDTSC - and I think that shows in a slightly slower
> pg_test_timing measurement compared to master when fast clock source
> is off. Can we somehow only do that when we use RDTSC?
The only way I can see to improve on that is by adding another if to
directly return ticks if the default clock is used.
That saves uselessly performing an addition, a multiplication and a
shift. The overflow check shouldn't matter much because it should never
be entered and hence perfectly branch predicted because
INSTR_TIME_TICKS_TO_NANOSEC(max_ticks_no_overflow) == 562949953421311 ==
~6.5 days.
We can test how that compares performance-wise. But I'm leaning towards
keeping it as is.
> Here is a fresh test run with this patch on an AWS c6i.xlarge, i.e.
> Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz / "Ice Lake":
>
> CREATE TABLE test (id int);
> INSERT INTO test SELECT * FROM generate_series(0, 1000000);
>
> postgres=# SET fast_clock_source = off;
> SET
> Time: 0.107 ms
> postgres=# EXPLAIN ANALYZE SELECT COUNT(*) FROM test;
> QUERY
> PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------
> Finalize Aggregate (cost=10633.55..10633.56 rows=1 width=8) (actual
> time=44.117..44.811 rows=1.00 loops=1)
> Buffers: shared hit=846 read=3579
> -> Gather (cost=10633.34..10633.55 rows=2 width=8) (actual
> time=44.060..44.804 rows=3.00 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> Buffers: shared hit=846 read=3579
> -> Partial Aggregate (cost=9633.34..9633.35 rows=1 width=8)
> (actual time=42.129..42.130 rows=1.00 loops=3)
> Buffers: shared hit=846 read=3579
> -> Parallel Seq Scan on test (cost=0.00..8591.67
> rows=416667 width=0) (actual time=0.086..21.595 rows=333333.67
> loops=3)
> Buffers: shared hit=846 read=3579
> Planning Time: 0.043 ms
> Execution Time: 44.836 ms
> (12 rows)
>
> Time: 45.076 ms
>
> postgres=# SET fast_clock_source = rdtsc;
> SET
> Time: 0.123 ms
> postgres=# EXPLAIN ANALYZE SELECT COUNT(*) FROM test;
> QUERY
> PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------
> Finalize Aggregate (cost=10633.55..10633.56 rows=1 width=8) (actual
> time=32.943..33.912 rows=1.00 loops=1)
> Buffers: shared hit=1128 read=3297
> -> Gather (cost=10633.34..10633.55 rows=2 width=8) (actual
> time=32.868..33.906 rows=3.00 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> Buffers: shared hit=1128 read=3297
> -> Partial Aggregate (cost=9633.34..9633.35 rows=1 width=8)
> (actual time=30.705..30.706 rows=1.00 loops=3)
> Buffers: shared hit=1128 read=3297
> -> Parallel Seq Scan on test (cost=0.00..8591.67
> rows=416667 width=0) (actual time=0.080..15.223 rows=333333.67
> loops=3)
> Buffers: shared hit=1128 read=3297
> Planning Time: 0.042 ms
> Execution Time: 33.935 ms
> (12 rows)
>
> Time: 34.180 ms
>
> postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM test;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------
> Finalize Aggregate (cost=10633.55..10633.56 rows=1 width=8) (actual
> rows=1.00 loops=1)
> Buffers: shared hit=1410 read=3015
> -> Gather (cost=10633.34..10633.55 rows=2 width=8) (actual
> rows=3.00 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> Buffers: shared hit=1410 read=3015
> -> Partial Aggregate (cost=9633.34..9633.35 rows=1 width=8)
> (actual rows=1.00 loops=3)
> Buffers: shared hit=1410 read=3015
> -> Parallel Seq Scan on test (cost=0.00..8591.67
> rows=416667 width=0) (actual rows=333333.67 loops=3)
> Buffers: shared hit=1410 read=3015
> Planning Time: 0.042 ms
> Execution Time: 27.876 ms
> (12 rows)
>
> Time: 28.135 ms
That's a nice speedup.
It will probably show even higher with max_parallel_workers_per_gather =
0 because in your example, forking, plan serialization etc. has
significant overhead compared to the total runtime.
--
David Geier
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Álvaro Herrera | 2026-02-04 10:04:16 | Re: [V2] Adding new CRC32C implementation for IBM S390X |
| Previous Message | Álvaro Herrera | 2026-02-04 09:57:33 | Re: Retiring is_pushed_down |