Re: Inlining comparators as a performance optimisation

From: "Pierre C" <lists(at)peufeu(dot)com>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Peter Geoghegan" <peter(at)2ndquadrant(dot)com>, "PG Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Inlining comparators as a performance optimisation
Date: 2012-01-13 09:48:56
Message-ID: op.v70n7uhgeorkce@apollo13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 21 Sep 2011 18:13:07 +0200, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> On 21.09.2011 18:46, Tom Lane wrote:
>>> The idea that I was toying with was to allow the regular SQL-callable
>>> comparison function to somehow return a function pointer to the
>>> alternate comparison function,
>
>> You could have a new function with a pg_proc entry, that just returns a
>> function pointer to the qsort-callback.
>
> Yeah, possibly. That would be a much more invasive change, but cleaner
> in some sense. I'm not really prepared to do all the legwork involved
> in that just to get to a performance-testable patch though.

A few years ago I had looked for a way to speed up COPY operations, and it
turned out that COPY TO has a good optimization opportunity. At that time,
for each datum, COPY TO would :

- test for nullness
- call an outfunc through fmgr
- outfunc pallocs() a bytea or text, fills it with data, and returns it
(sometimes it uses an extensible string buffer which may be repalloc()d
several times)
- COPY memcpy()s returned data to a buffer and eventually flushes the
buffer to client socket.

I introduced a special write buffer with an on-flush callback (ie, a close
relative of the existing string-buffer), in this case the callback was
"flush to client socket", and several outfuncs (one per type) which took
that buffer as argument, besides the datum to output, and simply put the
datum inside the buffer, with appropriate transformations (like converting
to bytea or text), and flushed if needed.

Then the COPY TO BINARY of a constant-size datum would turn to :
- one test for nullness
- one C function call
- one test to ensure appropriate space available in buffer (flush if
needed)
- one htonl() and memcpy of constant size, which the compiler turns out
into a couple of simple instructions

I recall measuring speedups of 2x - 8x on COPY BINARY, less for text, but
still large gains.

Although eliminating fmgr call and palloc overhead was an important part
of it, another large part was getting rid of memcpy()'s which the compiler
turned into simple movs for known-size types, a transformation that can be
done only if the buffer write functions are inlined inside the outfuncs.
Compilers love constants...

Additionnally, code size growth was minimal since I moved the old outfuncs
code into the new outfuncs, and replaced the old fmgr-callable outfuncs
with "create buffer with on-full callback=extend_and_repalloc() - pass to
new outfunc(buffer,datum) - return buffer". Which is basically equivalent
to the previous palloc()-based code, maybe with a few extra instructions.

When I submitted the patch for review, Tom rightfully pointed out that my
way of obtaining the C function pointer sucked very badly (I don't
remember how I did it, only that it was butt-ugly) but the idea was to get
a quick measurement of what could be gained, and the result was positive.
Unfortunately I had no time available to finish it and make it into a real
patch, I'm sorry about that.

So why do I post in this sorting topic ? It seems, by bypassing fmgr for
functions which are small, simple, and called lots of times, there is a
large gain to be made, not only because of fmgr overhead but also because
of the opportunity for new compiler optimizations, palloc removal, etc.
However, in my experiment the arguments and return types of the new
functions were DIFFERENT from the old functions : the new ones do the same
thing, but in a different manner. One manner was suited to sql-callable
functions (ie, palloc and return a bytea) and another one to writing large
amounts of data (direct buffer write). Since both have very different
requirements, being fast at both is impossible for the same function.

Anyway, all that rant boils down to :

Some functions could benefit having two versions (while sharing almost all
the code between them) :
- User-callable (fmgr) version (current one)
- C-callable version, usually with different parameters and return type

And it would be cool to have a way to grab a bare function pointer on the
second one.

Maybe an extra column in pg_proc would do (but then, the proargtypes and
friends would describe only the sql-callable version) ?
Or an extra table ? pg_cproc ?
Or an in-memory hash : hashtable[ fmgr-callable function ] => C version
- What happens if a C function has no SQL-callable equivalent ?
Or (ugly) introduce an extra per-type function type_get_function_ptr(
function_kind ) which returns the requested function ptr

If one of those happens, I'll dust off my old copy-optimization patch ;)

Hmm... just my 2c

Regards
Pierre

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2012-01-13 10:04:49 replay_location indicates incorrect location
Previous Message Simon Riggs 2012-01-13 09:34:05 Re: pgbench post-connection command