Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Lukas Fittl <lukas(at)fittl(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date: 2022-07-01 17:26:39
Message-ID: 20220701172639.ty4iu5almspueriu@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
> On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> > The attached patches are really just a prototype. I'm also not really
> > planning to work on getting this into a "production ready" patchset
> > anytime soon. I developed it primarily because I found it the overhead
> > made it too hard to nail down in which part of a query tree performance
> > changed. If somebody else wants to continue from here...
> >
> > I do think it's be a pretty significant improvement if we could reduce
> > the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
> > bunch of low-level code.
> >
>
> Based on an off-list conversation with Andres, I decided to dust off this
> old
> patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
> improvements (especially when using rdtsc instead of rdtsc*p*) seem to
> warrant
> giving this a more thorough look.
>
> See attached an updated patch (adding it to the July commitfest), with a few
> changes:
>
> - Keep using clock_gettime() as a fallback if we decide to not use rdtsc

Yep.

> - Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work

I suspect that this might not be needed anymore. Seems like it'd be ok to just
fall back to clock_gettime() in that case.

> - In an abundance of caution, for now I've decided to only enable this if we
> are on Linux/x86, and the current kernel clocksource is TSC (the kernel
> has
> quite sophisticated logic around making this decision, see [1])

I think our requirements are a bit lower than the kernel's - we're not
tracking wall clock over an extended period...

> Note that if we implemented the decision logic ourselves (instead of relying
> on the Linux kernel), I'd be most worried about older virtualization
> technology. In my understanding getting this right is notably more
> complicated
> than just checking cpuid, see [2].

> Known WIP problems with this patch version:
>
> * There appears to be a timing discrepancy I haven't yet worked out, where
> the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
> reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
> higher for \timing than for the EXPLAIN ANALYZE time reported on the
> server
> side, only when rdtsc measurement is used -- its likely there is a problem
> somewhere with how we perform the cycles to time conversion

Could you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.

> * Possibly related, the floating point handling for the cycles_to_sec
> variable
> is problematic in terms of precision (see FIXME, taken over from Andres'
> POC)

And probably also performance...

> Open questions from me:
>
> 1) Do we need to account for different TSC offsets on different CPUs in SMP
> systems? (the Linux kernel certainly has logic to that extent, but [3]
> suggests this is no longer a problem on Nehalem and newer chips, i.e.
> those
> having an invariant TSC)

I don't think we should cater to systems where we need that.

> 2) Should we have a setting "--with-tsc" for configure? (instead of always
> enabling it when on Linux/x86 with a TSC clocksource)

Probably not worth it.

> 3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
> current instructions to finish -- the prior discussion seemed to suggest
> we don't want it for node instruction measurements, but possibly we do
> want
> this in other cases?)

I was wondering about that too... Perhaps we should add a
INSTR_TIME_SET_CURRENT_BARRIER() or such?

> 4) Should we support using the "mrs" instruction on ARM? (which is similar
> to
> rdtsc, see [4])

I'd leave that for later personally.

> #define NS_PER_S INT64CONST(1000000000)
> #define US_PER_S INT64CONST(1000000)
> #define MS_PER_S INT64CONST(1000)
> @@ -95,17 +104,37 @@ typedef int64 instr_time;
>
> #define INSTR_TIME_SET_ZERO(t) ((t) = 0)
>
> -static inline instr_time pg_clock_gettime_ns(void)
> +extern double cycles_to_sec;
> +
> +bool use_rdtsc;

This should be extern and inside the ifdef below.

> +#if defined(__x86_64__) && defined(__linux__)
> +extern void pg_clock_gettime_initialize_rdtsc(void);
> +#endif
> +
> +static inline instr_time pg_clock_gettime_ref_cycles(void)
> {
> struct timespec tmp;
>
> +#if defined(__x86_64__) && defined(__linux__)
> + if (use_rdtsc)
> + return __rdtsc();
> +#endif
> +
> clock_gettime(PG_INSTR_CLOCK, &tmp);
>
> return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
> }
>

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-07-01 17:30:16 Re: EINTR in ftruncate()
Previous Message Mark Wong 2022-07-01 17:18:20 Re: real/float example for testlibpq3