Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date: 2021-03-13 00:34:16
Message-ID: CA+hUKGJwFdnfYx1+C8b0AnBnPxAsaKCNtDfCtpuUHOr+_kALzQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 13, 2020 at 11:28 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> [PATCH v1 1/2] WIP: Change instr_time to just store nanoseconds, that's cheaper.

Makes a lot of sense. If we do this, I'll need to update pgbench,
which just did something similar locally. If I'd been paying
attention to this thread I might not have committed that piece of the
recent pgbench changes, but it's trivial stuff and I'll be happy to
tidy that up when the time comes.

> [PATCH v1 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time for instrumentation.

> Some of the time is spent doing function calls, dividing into struct
> timespec, etc. But most of it just the rdtscp instruction:
> 65.30 │1 63: rdtscp

> The reason for that is largely that rdtscp waits until all prior
> instructions have finished (but it allows later instructions to already
> start). Multiple times for each tuple.

Yeah, after reading a bit about this, I agree that there is no reason
to think that the stalling version makes the answer better in any way.
It might make sense if you use it once at the beginning of a large
computation, but it makes no sense if you sprinkle it around inside
blocks that will run multiple times. It destroys your
instructions-per-cycle while, turning your fancy super scalar Pentium
into a 486. It does raise some interesting questions about what
exactly you're measuring, though: I don't know enough to have a good
grip on how far out of order the TSC could be read!

> There's also other issues with using rdtsc directly: On older CPUs, in
> particular older multi-socket systems, the tsc will not be synchronized
> in detail across cores. There's bits that'd let us check whether tsc is
> suitable or not. The more current issue of that is that things like
> virtual machines being migrated can lead to rdtsc suddenly returning a
> different value / the frequency differening. But that is supposed to be
> solved these days, by having virtualization technologies set frequency
> multipliers and offsets which then cause rdtsc[p] to return something
> meaningful, even after migration.

Googling tells me that Nehalem (2008) introduced "invariant TSC"
(clock rate independent) and also socket synchronisation at the same
time, so systems without it are already pretty long in the tooth.

A quick peek at an AMD manual[1] tells me that a similar change
happened in 15H/Bulldozer/Piledriver/Steamroller/Excavator (2011),
identified with the same CPUID test.

My first reaction is that it seems like TSC would be the least of your
worries if you're measuring a VM that's currently migrating between
hosts, but maybe the idea is just that you have to make sure you don't
assume it can't ever go backwards or something like that...

Google Benchmark has some clues about how to spell this on MSVC, what
some instructions might be to research on ARM, etc.

[1] https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf
(page 373)
[2] https://github.com/google/benchmark/blob/master/src/cycleclock.h

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pantelis Theodosiou 2021-03-13 01:03:19 Fwd: GROUP BY DISTINCT
Previous Message Tomas Vondra 2021-03-12 23:33:36 Re: GROUP BY DISTINCT