dg(at)illustra(dot)com (David Gould) writes:
> [ Much careful testing snipped ]
Nice job, David!
> gprof introduces severe experimental error for small quick functions.
> Perhaps the spinlock function is so short and quick compared to the
> mcount overhead added to the function prolog that the overhead dominates
> the measurement. gprof remains a useful tool for larger functions with
> longer runtimes, but must be considered very suspect for tiny functions.
Right. gprof is a fine example of Heisenberg's Uncertainty Principle
applied to software ;-). You can't measure something without affecting it.
As Bruce Momjian just pointed out in another followup, running the test
function a lot of times in a tight loop isn't a perfect answer either:
you find out what happens when the function's code and referenced data are
all in cache, but you can't tell much about what happens when they are
not; and you can't tell whether the whole application's memory usage
patterns are such that the function will remain in cache.
I'm guessing that the backend uses tas() enough that it will probably
stay in cache, but that is *strictly* a guess with no evidence.
> In some of the test cases there was significant timing variation from
> run to run even though the test conditions were apparently identical.
> Even more strangely, the variation in time was not random but appeared
> to represent two different modes. And, the variation was itself
> I have no explanation for this variation. Possibly it is some interaction
> of where the program is loaded and the state of the memory heirarchy, but
> even this is hard to sustain. I would be very curious to hear of any
> plausible explainations.
After chewing on this for a while I think that your speculation is
right. You were using a Pentium, you said. The Pentium has a two-way
set associative cache, which means that any given main-memory address
has exactly two cache lines it could be loaded into. Main-memory
addresses that are 1000H apart contend for the same pair of cache lines.
Thus, if your program happens to repeatedly hit three locations that are
exactly 1000H apart, it will suffer a cache miss every time. Change the
address spacing, and no miss occurs. The cache miss takes forty-some
clock cycles, versus one if the target location is in cache.
(BTW, I'm getting this info out of Rick Booth's "Inner Loops", a fine
reference book if you are into hand-optimized assembly coding for Intel
So what I think you were seeing is that on some runs, the loop involved
hitting three addresses that contend for the same cache line pair, while
on other runs there was no cache contention. This could be explained
by varying load addresses for the program, if it touched both its own
addresses (variable) and some non-varying addresses --- say, C library
routines executed from a shared library that remained loaded throughout.
If you can link with a non-shared C library then you should get more
consistent runtimes, because the offsets between all the locations
touched by your loop would be fixed, and thus cache hits or misses ought
to be consistent from run to run.
The bottom line, however, is that this behavior is too unpredictable
to be worth worrying about in a production program. A cache miss in
a tight loop could be created or eliminated by unrelated changes in
distant parts of the code ... and in any case the behavior will be
different on different CPUs. The 486, Pentium, and Pentium Pro all
have radically different cache layouts, let alone non-Intel CPUs.
regards, tom lane
pgsql-hackers by date
|Next:||From: Chris Albertson||Date: 1998-06-12 18:30:31|
|Previous:||From: Bruce Momjian||Date: 1998-06-12 12:21:18|
|Subject: Re: [HACKERS] inlining|