Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date: 2020-06-13 03:53:01
Message-ID: CAFj8pRDhGSiSHWaiFU_5xqiTYQexhVuj+zbuO0JP-QTkES+VNQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

so 13. 6. 2020 v 1:28 odesílatel Andres Freund <andres(at)anarazel(dot)de> napsal:

> Hi,
>
> Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
> the resulting timing enough that the times aren't meaningful. E.g.
>
> CREATE TABLE lotsarows(key int not null);
> INSERT INTO lotsarows SELECT generate_series(1, 50000000);
> VACUUM FREEZE lotsarows;
>
>
> -- best of three:
> SELECT count(*) FROM lotsarows;
> Time: 1923.394 ms (00:01.923)
>
> -- best of three:
> EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
> Time: 2319.830 ms (00:02.320)
>
> -- best of three:
> EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 4202.649 ms (00:04.203)
>
> That nearly *double* the execution time without TIMING.
>
>
> Looking at a profile of this shows that we spend a good bit of cycles
> "normalizing" timstamps etc. That seems pretty unnecessary, just forced
> on us due to struct timespec. So the first attached patch just turns
> instr_time to be a 64bit integer, counting nanoseconds.
>
> That helps, a tiny bit:
> EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 4179.302 ms (00:04.179)
>
> but obviously doesn't move the needle.
>
>
> Looking at a profile it's easy to confirm that we spend a lot of time
> acquiring time:
> - 95.49% 0.00% postgres postgres [.]
> agg_retrieve_direct (inlined)
> - agg_retrieve_direct (inlined)
> - 79.27% fetch_input_tuple
> - ExecProcNode (inlined)
> - 75.72% ExecProcNodeInstr
> + 25.22% SeqNext
> - 21.74% InstrStopNode
> + 17.80% __GI___clock_gettime (inlined)
> - 21.44% InstrStartNode
> + 19.23% __GI___clock_gettime (inlined)
> + 4.06% ExecScan
> + 13.09% advance_aggregates (inlined)
> 1.06% MemoryContextReset
>
> And that's even though linux avoids a syscall (in most cases) etc to
> acquire the time. Unless the kernel detects there's a reason not to do
> so, linux does this by using 'rdtscp' and multiplying it by kernel
> provided factors to turn the cycles into time.
>
> Some of the time is spent doing function calls, dividing into struct
> timespec, etc. But most of it just the rdtscp instruction:
> 65.30 │1 63: rdtscp
>
>
> The reason for that is largely that rdtscp waits until all prior
> instructions have finished (but it allows later instructions to already
> start). Multiple times for each tuple.
>
>
> In the second attached prototype patch I've change instr_time to count
> in cpu cycles instead of nanoseconds. And then just turned the cycles
> into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).
>
> When using rdtsc that results in *vastly* lower overhead:
>
> ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │ QUERY PLAN
> │
>
> ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual
> time=2610.235..2610.235 rows=1 loops=1) │
> │ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016
> width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │
> │ Planning Time: 0.028 ms
> │
> │ Execution Time: 2610.256 ms
> │
>
> └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> (4 rows)
>
> Time: 2610.589 ms (00:02.611)
>
> And there's still some smaller improvements that could be made ontop of
> that.
>
> As a comparison, here's the time when using rdtscp directly in
> instr_time, instead of going through clock_gettime:
> Time: 3481.162 ms (00:03.481)
>
> That shows pretty well how big the cost of the added pipeline stalls
> are, and how important out-of-order execution is for decent
> performance...
>
>
> In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
> do *not* want to wait for prior instructions to finish, since that
> actually leads to the timing being less accurate, rather than
> more. There are other cases where that'd be different, e.g. measuring
> how long an entire query takes or such (but there it's probably
> irrelevant which to use).
>
>
> I've above skipped a bit over the details of how to turn the cycles
> returned by rdtsc into time:
>
> On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
> have actually been run, but instead returns the number of 'reference
> cycles'. That's important because otherwise things like turbo mode and
> lower power modes would lead to completely bogus times.
>
> Thus, knowing the "base frequency" of the CPU allows to turn the
> difference between two rdtsc return values into seconds.
>
> In the attached prototype I just determined the number of cycles using
> cpuid(0x16). That's only available since Skylake (I think). On older
> CPUs we'd have to look at /proc/cpuinfo or
> /sys/devices/system/cpu/cpu0/cpufreq/base_frequency.
>
>
> There's also other issues with using rdtsc directly: On older CPUs, in
> particular older multi-socket systems, the tsc will not be synchronized
> in detail across cores. There's bits that'd let us check whether tsc is
> suitable or not. The more current issue of that is that things like
> virtual machines being migrated can lead to rdtsc suddenly returning a
> different value / the frequency differening. But that is supposed to be
> solved these days, by having virtualization technologies set frequency
> multipliers and offsets which then cause rdtsc[p] to return something
> meaningful, even after migration.
>
>
> The attached patches are really just a prototype. I'm also not really
> planning to work on getting this into a "production ready" patchset
> anytime soon. I developed it primarily because I found it the overhead
> made it too hard to nail down in which part of a query tree performance
> changed. If somebody else wants to continue from here...
>
> I do think it's be a pretty significant improvement if we could reduce
> the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
> bunch of low-level code.
>

+1

Pavel

> Greetings,
>
> Andres Freund
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-06-13 04:12:20 Re: Parallel copy
Previous Message Tom Lane 2020-06-13 02:13:41 Re: doc examples for pghandler