| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | Lukas Fittl <lukas(at)fittl(dot)com> |
| Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, David Geier <geidav(dot)pg(at)gmail(dot)com>, andrew(at)dunslane(dot)net |
| Subject: | Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |
| Date: | 2026-04-08 19:44:02 |
| Message-ID: | 26pttea74dz5gvsplaa4qsrwuytrza6g2mieqb7wxi4nnwb3cp@7uhq7zkfm6me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On 2026-04-08 12:25:54 -0700, Lukas Fittl wrote:
> > # TSC frequency in use: 7 kHz
> > # TSC frequency from calibration: 2500044 kHz
> > # TSC clock source will be used by default, unless timing_clock_source is set to 'system'.
> >
> > Sooo, this system claims to have an invariant tsc but the frequency
> > we are getting from cpuid is completely out of whack.
>
> Huh. Yeah, I think this is a case of getting a bad TSC frequency from CPUID.
Yea. Not sure yet why...
> I half wonder if this could be a case of a Hypervisor, but not KVM or
> VMware, and so we fall through to the regular CPUID information (which
> AFAIR is different from how Linux itself handles that case, where
> it'll always do calibration in such cases). I think the solution might
> be to use the TSC calibration always on other hypervisors.
Plausible.
> But lets wait for Andrew to confirm the configuration of the machine /
> have a run with the additional information.
Yep.
> > I wonder if we also should add a pg_timing_clock_source_info() function that
> > returns frequency_khz, calibrated_frequency_khz, frequency_source_info or
> > such?
>
> See attached a patch that adds that and shows its output in
> pg_test_timing. Here is an example from an AWS instance:
>
> TSC frequency in use: 2899943 kHz
> TSC frequency source: x86, hypervisor (kvm), cpuid 0x40000010
> TSC frequency from calibration: 2899063 kHz
> TSC clock source will be used by default, unless timing_clock_source
> is set to 'system'.
>
> And from Azure (HyperV):
>
> TSC frequency in use: 2791936 kHz
> TSC frequency source: x86, calibration
> TSC frequency from calibration: 2793379 kHz
> TSC clock source will be used by default, unless timing_clock_source
> is set to 'system'.
Nice.
> Note it doesn't emit the fact that its a hypervisor if calibration was
> used to keep the code a bit simpler, but would emit it as "hypervisor
> (other)" or "hypervisor (unknown)" (if cpuidex wasn't available) if
> cpuid 0x15/0x16 get used.
That seems ok for now, I think.
What do you think about making pg_test_timing warn and return 1 if there is a
tsc clocksource but the calibrated frequency differs by more than, idk, 10%?
I'm worried that there might be other problems like this lurking and we
wouldn't know about them unless the issue is of a similar magnitude.
>
> #if PG_INSTR_TSC_CLOCK
> +static const char *timing_tsc_frequency_source = NULL;
Think this ought to document what the lifetime of the memory is.
> /*
> * Initialize the TSC clock source by determining its usability and frequency.
> @@ -202,13 +205,14 @@ static void
> tsc_detect_frequency(void)
> {
> timing_tsc_frequency_khz = 0;
> + timing_tsc_frequency_source = NULL;
>
> /* We require RDTSCP support and an invariant TSC, bail if not available */
> if (!x86_feature_available(PG_RDTSCP) || !x86_feature_available(PG_TSC_INVARIANT))
> return;
>
> /* Determine speed at which the TSC advances */
> - timing_tsc_frequency_khz = x86_tsc_frequency_khz();
> + timing_tsc_frequency_khz = x86_tsc_frequency_khz(&timing_tsc_frequency_source);
> if (timing_tsc_frequency_khz > 0)
> return;
So, if we were called again, we'd leak the old content of the memory.
But what I think is more problematic is that x86_tsc_frequency_khz() allocates
memory with palloc(), via psprintf(), but there's no guarantee that it's
allocated in a sufficiently long lived allocation.
timing_tsc_frequency_source were a char[128] and you passed the string and
length to x86_tsc_frequency_khz and then used snprintf(). That would also
make it a bit easier to iteratively populate the string.
> +/*
> + * Returns TSC clock source information for diagnostic purposes.
> + *
> + * Note: This always runs the TSC calibration loop which may take up to
> + * TSC_CALIBRATION_MAX_NS.
> + */
> +TscClockSourceInfo
> +pg_timing_tsc_clock_source_info(void)
> +{
> + TscClockSourceInfo info;
> +
> + info.frequency_khz = timing_tsc_frequency_khz;
> + info.frequency_source = timing_tsc_frequency_source;
> + info.calibrated_frequency_khz = pg_tsc_calibrate_frequency();
Seems we should also store the calibration somewhere and only run it if it
hadn't already done?
What if we just made this TscClockSourceInfo a struct in the file, populated
it during normal initialization, and then returned it as a const
TscClockSourceInfo *from pg_timing_tsc_clock_source_info?
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Haibo Yan | 2026-04-08 19:50:50 | Re: Extract numeric filed in JSONB more effectively |
| Previous Message | Sami Imseih | 2026-04-08 19:35:29 | Re: Add pg_stat_autovacuum_priority |