Re: That EXPLAIN ANALYZE patch still needs work

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: That EXPLAIN ANALYZE patch still needs work
Date: 2006-06-06 20:50:28
Message-ID: 23053.1149627028@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> The problem looks to be an underestimation of SampleOverhead, and on
> reflection it's clear why: what CalculateSampleOverhead is measuring
> isn't the total overhead, but the time between the two gettimeofday
> calls. Which is probably about half the true overhead.

On further thought, I take that back: the "true overhead" is not the
point here. The time elapsed during a plan node execution when
sampling can be broken down into three phases:
time before first gettimeofday call
time between gettimeofday calls
time after second gettimeofday call
Only the second part is actually measured by the instrumentation system;
the other parts are overhead that has never been counted by EXPLAIN
ANALYZE, sampling version or no. Moreover, most of the runtime of
InstrStartNode and InstrStopNode falls into the first or third parts.

What we would actually like to set SampleOverhead to is the portion
of the second-part runtime that doesn't occur when sampling = false.
Assuming that gettimeofday() has consistent runtime and the actual
time reported is measured at a consistent instant within that runtime,
I believe that we should take the SampleOverhead as just equal to
the runtime of a single gettimeofday() call. The added or removed
second-part time within InstrStartNode is just the tail time of
gettimeofday, and the added or removed second-part time within
InstrStopNode is basically just the head time of gettimeofday. (To make
this as true as possible, we need to change the order of operations so
that gettimeofday is invoked *immediately* after the "if (sampling)"
test, but that's easy.)

So this line of thought leads to the conclusion that
CalculateSampleOverhead is actually overestimating SampleOverhead
a bit, and we should simplify it to just time INSTR_TIME_SET_CURRENT().

But that still leaves me with a problem because my machine is clearly
overestimating the correction needed. I added some printouts and got

raw totaltime = 0.370937
per_iter = 0.000156913, SampleOverhead = 3.28e-06
adj totaltime = 1.82976
sampling = 0
starttime = 0/000000
counter = 0/370937
firsttuple = 0.258321
tuplecount = 10000
itercount = 10001
samplecount = 704
nextsample = 10011
startup = 0.258321
total = 1.82976
ntuples = 10000
nloops = 1

on a run with an actual elapsed time near 750 msec. Clearly the
sampling adjustment is wrong, but why?

I have a theory about this, and it's not pleasant at all. What I
think is that we have a Heisenberg problem here: the act of invoking
gettimeofday() actually changes what is measured. That is, the
runtime of the "second part" of ExecProcNode is actually longer when
we sample than when we don't, not merely due to the extra time spent
in gettimeofday(). It's not very hard to guess at reasons why, either.
The kernel entry is probably flushing some part of the CPU's state,
such as virtual/physical address mapping for the userland address
space. After returning from the kernel call, the time to reload
that state shows up as more execution time within the "second part".

This theory explains two observations that otherwise are hard to
explain. One, that the effect is platform-specific: your machine
may avoid flushing as much state during a kernel call as mine does.
And two, that upper plan nodes seem much more affected than lower
ones. That makes sense because the execution cycle of an upper node
will involve touching more userspace data than a lower node, and
therefore more of the flushed TLB entries will need to be reloaded.

If this theory is correct, then the entire notion of EXPLAIN ANALYZE
sampling has just crashed and burned. We can't ship a measurement
tool that is accurate on some platforms and not others.

I'm wondering if it's at all practical to go over to a gprof-like
measurement method for taking EXPLAIN ANALYZE runtime measurements;
that is, take an interrupt every so often and bump the count for the
currently active plan node. This would spread the TLB-flush overhead
more evenly and thus avoid introducing that bias. There may be too much
platform dependency in this idea, though, and I also wonder if it'd
break the ability to do normal gprof profiling of the backend.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-06-06 21:01:53 Re: AIX check in datetime.h
Previous Message Jim C. Nasby 2006-06-06 20:39:26 Re: Duplicate rows sneaking in despite PRIMARY KEY / UNIQUE