On Mon, Oct 24, 2011 at 4:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I wonder how trustworthy the measure of the visibilitymap_test call site
>>> as a consumer of cycles really is.
>> I'm not sure either. I guess we could try short-circuiting
>> visibilitymap_test and see what that does to performance (let's leave
>> correct answers out of it).
> That would conflate the cost of the call with the cost of the function.
> Maybe you could try manually inlining the visibility test?
I fooled around with this some more on my Fedora 14 desktop machine.
pgbench database, scale factor 20, shared_buffers=400MB. I ran the
query "select count(*) from pgbench_accounts". I'm having a hard time
getting completely reproducible results, but it appears that warming
the cache makes the sequential scan go faster drop from maybe 390 ms
to 245 ms, and an index-only scan takes about 350 ms, so which one is
better depends a lot on your assumptions about what is going on on the
system at the same time (which means maybe we ought not to sweat it).
If I wipe out the whole if-block that calls visibilitymap_test(), the
index-only scan drops down to about 300 ms (and delivers potentially
wrong answers, of course). Inlining visibilitymap_test (see attached
vismap-inline.patch) causes the index-only scan to drop to about 330
I also tried changing the BufferIsValid() tests in
visibilitymap_test() to use BufferIsInvalid() instead, with the sense
of the tests reversed (see attached vismap-test-invalid.patch). Since
BufferIsInvalid() just checks for InvalidBuffer instead of also doing
the sanity checks, it's significantly cheaper. This also reduced the
time to about 330 ms, so seems clearly worth doing. Apparently these
changes don't stack, because doing both things only gets me down to
about 320 ms, which is fairly unexciting for the amount of ugliness
that inlining entails.
I tried sprinkling a little branch-prediction magic on this code using
GCC's __builtin_expect(). That initially seemed to help, but once I
changed the BufferIsValid() test to !BufferIsInvalid() essentially all
of the savings disappeared.
I also spent some time poking through the opannotate -s -a output,
which shows where time is being spent by individual assembler
instruction, but also annotates the assembler listing with the
original source code. At least according to oprofile, the time that
is being spent in IndexOnlyNext() is mostly being spent on seemingly
innocuous operations like saving and restoring registers. For
example, much of the time being attributed to the visibilitymap_test()
call in IndexOnlyNext() is actually attributable to the instructions
that are calculating what address to pass for scandesc->heapRelation.
Many but not all of the pointer deferences at the top of
IndexOnlyNext() have a chunk cycles attributed to them, and while
they're not that significant individually, they add up. Similarly,
the long series of instructions to which index_getattr() resolves
bleeds cycles at just about every step. There's not much to optimize
there, though, unless we want to add some code that avoids decoding
the tuple altogether in the particular case of a zero-argument
aggregate, or maybe something more general that only pulls out the
columns that are actually needed.
The Enterprise PostgreSQL Company
In response to
pgsql-hackers by date
|Next:||From: Robert Haas||Date: 2011-10-28 18:33:25|
|Subject: Re: ecpg-related build failure with make 3.82|
|Previous:||From: Tom Lane||Date: 2011-10-28 17:24:31|
|Subject: Re: ecpg-related build failure with make 3.82 |