Quick Links

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From:	Lukas Fittl <lukas(at)fittl(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, David Geier <geidav(dot)pg(at)gmail(dot)com>, andrew(at)dunslane(dot)net, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>
Subject:	Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date:	2026-04-09 04:36:48
Message-ID:	CAP53PkwXBHQhpcW9k8PnUN+NZjVbmWwx7unehHxANBjD95a7Ow@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Apr 8, 2026 at 12:44 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> > I half wonder if this could be a case of a Hypervisor, but not KVM or
> > VMware, and so we fall through to the regular CPUID information (which
> > AFAIR is different from how Linux itself handles that case, where
> > it'll always do calibration in such cases). I think the solution might
> > be to use the TSC calibration always on other hypervisors.
>
> Plausible.
>

And that is indeed the problem. Looks like we can't trust CPUID
0x15/0x16 when we're under a Hypervisor and its not KVM or VMware.

I was able to reproduce this on an m5.xlarge instance on AWS with
Windows Server 2019 installed, building with MSVC 2019:

TSC frequency in use: 7 kHz
TSC frequency source: x86, hypervisor (other), cpuid 0x15
TSC frequency from calibration: 2499879 kHz
TSC clock source will be used by default, unless timing_clock_source
is set to 'system'.

And when we return after we realize the Hypervisor can't give us the
frequency and use calibration instead:

TSC frequency in use: 2500357 kHz
TSC frequency source: x86, calibration
TSC frequency from calibration: 2499793 kHz
TSC clock source will be used by default, unless timing_clock_source
is set to 'system'.

It appears the problem is related to however AWS implements the
"nitro" virtualization layer for Windows. Interestingly enough, a
m5.xlarge instance with Linux works just fine, and shows a KVM
hypervisor. For Linux, it gets the correct information from the
special CPUID register for KVM/VMware, which matches the calibrated
TSC frequency shown under Windows.

FWIW, earlier deprecated EC2 instance types that used Xen (e.g.
m4.xlarge) just report "TSC unusable" on Windows, presumably because
its not an invariant TSC, but I haven't dug into it since it seems
fine to automatically use the system clock source in that case.

> > > I wonder if we also should add a pg_timing_clock_source_info() function that
> > > returns frequency_khz, calibrated_frequency_khz, frequency_source_info or
> > > such?
> >
> > See attached a patch that adds that and shows its output in
> > pg_test_timing. Here is an example from an AWS instance:
> >
> > TSC frequency in use: 2899943 kHz
> > TSC frequency source: x86, hypervisor (kvm), cpuid 0x40000010
> > TSC frequency from calibration: 2899063 kHz
> > TSC clock source will be used by default, unless timing_clock_source
> > is set to 'system'.
> >
> > And from Azure (HyperV):
> >
> > TSC frequency in use: 2791936 kHz
> > TSC frequency source: x86, calibration
> > TSC frequency from calibration: 2793379 kHz
> > TSC clock source will be used by default, unless timing_clock_source
> > is set to 'system'.
>
> Nice.
>

FWIW, I also ran tests today on VMware (AMD CPU):

TSC frequency in use: 2994329 kHz
TSC frequency source: x86, hypervisor (vmware), cpuid 0x40000010
TSC frequency from calibration: 2994008 kHz
TSC clock source will be used by default, unless timing_clock_source
is set to 'system'.

And Virtualbox (Intel CPU, thanks Maciek):

TSC frequency in use: 2989463 kHz
TSC frequency source: x86, calibration
TSC frequency from calibration: 2989619 kHz
TSC clock source will be used by default, unless timing_clock_source
is set to 'system'.

This was without the early return, i.e. Virtualbox doesn't pass
through CPUID even if the host has it on Intel CPUs.

> What do you think about making pg_test_timing warn and return 1 if there is a
> tsc clocksource but the calibrated frequency differs by more than, idk, 10%?
> I'm worried that there might be other problems like this lurking and we
> wouldn't know about them unless the issue is of a similar magnitude.

Yeah, that seems like a good idea. If I understand correctly you're
thinking we could tell the user to switch to
timing_clock_source=system in that case? (i.e. this is only a
pg_test_timing notice, not something "smarter" in the backend itself)

> >
> > #if PG_INSTR_TSC_CLOCK
> > +static const char *timing_tsc_frequency_source = NULL;
>
> Think this ought to document what the lifetime of the memory is.

Per your point below, I agree it seems best to just use a fixed buffer.

> > /*
> > * Initialize the TSC clock source by determining its usability and frequency.
> > @@ -202,13 +205,14 @@ static void
> > tsc_detect_frequency(void)
> > {
> > timing_tsc_frequency_khz = 0;
> > + timing_tsc_frequency_source = NULL;
> >
> > /* We require RDTSCP support and an invariant TSC, bail if not available */
> > if (!x86_feature_available(PG_RDTSCP) || !x86_feature_available(PG_TSC_INVARIANT))
> > return;
> >
> > /* Determine speed at which the TSC advances */
> > - timing_tsc_frequency_khz = x86_tsc_frequency_khz();
> > + timing_tsc_frequency_khz = x86_tsc_frequency_khz(&timing_tsc_frequency_source);
> > if (timing_tsc_frequency_khz > 0)
> > return;
>
> So, if we were called again, we'd leak the old content of the memory.
>
> But what I think is more problematic is that x86_tsc_frequency_khz() allocates
> memory with palloc(), via psprintf(), but there's no guarantee that it's
> allocated in a sufficiently long lived allocation.
>
> timing_tsc_frequency_source were a char[128] and you passed the string and
> length to x86_tsc_frequency_khz and then used snprintf(). That would also
> make it a bit easier to iteratively populate the string.

Ack, good point.

> > +/*
> > + * Returns TSC clock source information for diagnostic purposes.
> > + *
> > + * Note: This always runs the TSC calibration loop which may take up to
> > + * TSC_CALIBRATION_MAX_NS.
> > + */
> > +TscClockSourceInfo
> > +pg_timing_tsc_clock_source_info(void)
> > +{
> > + TscClockSourceInfo info;
> > +
> > + info.frequency_khz = timing_tsc_frequency_khz;
> > + info.frequency_source = timing_tsc_frequency_source;
> > + info.calibrated_frequency_khz = pg_tsc_calibrate_frequency();
>
> Seems we should also store the calibration somewhere and only run it if it
> hadn't already done?
>
> What if we just made this TscClockSourceInfo a struct in the file, populated
> it during normal initialization, and then returned it as a const
> TscClockSourceInfo *from pg_timing_tsc_clock_source_info?

Yeah, that seems reasonable.

Attached 0001 fixes the issue for me on my test instance, and
presumably will fix drongo as well.

0002 is the updated version of emitting the additional debug info. I
think this is certainly less critical to have in 19 now, but could
still be useful if there are any future oddities.

Thanks,
Lukas

--
Lukas Fittl

Attachment	Content-Type	Size
v25-0001-instrumentation-Avoid-CPUID-0x15-0x16-for-Hyperv.patch	application/octet-stream	2.0 KB
v25-0002-instrumentation-Show-additional-TSC-clock-source.patch	application/octet-stream	8.6 KB

In response to

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? at 2026-04-08 19:44:02 from Andres Freund

Responses

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? at 2026-04-09 16:02:28 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2026-04-09 04:54:05	Re: Adding REPACK [concurrently]
Previous Message	Peter Smith	2026-04-09 04:27:32	Add missing period to DETAIL messages