Re: Function execution costs 'n all that

From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Function execution costs 'n all that
Date: 2007-01-29 02:50:19
Message-ID: 45BD60EB.2060106@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> Mark Dilger <pgsql(at)markdilger(dot)com> writes:
>> Tom Lane wrote:
>>> Would a simple constant value be workable, or do we need some more
>>> complex model (and if so what)?
>
>> Consider:
>> ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03;
>> ...
>> It seems to me that the above system would work perfectly well for
>> collecting the number of rows returned from a set returning function,
>> not just the run times.
>
> I don't really think that data collection is the bottleneck.

Ahh, I'm not just thinking about data collection. I'm thinking about usability
for non-hackers who know enough plpgsql to write a function and then want to
train the system to plan for it appropriately. It's a much easier task for a
novice user to say "go away and figure out how expensive this thing is" than for
a novice to think about things like statistical variance, etc. We don't demand
that users have that kind of knowledge to write queries or run analyze on
tables, so why would they need that kind of knowledge to write a function?

> If a
> constant estimate isn't good enough for you, then you need some kind of
> model of how the runtime or number of rows varies with the function's
> inputs ... and I hardly see how something like the above is likely to
> figure out how to fit a good model. Or at least, if you think it can,
> then you skipped all the interesting bits.

I am (perhaps naively) imagining that the user will train the database over the
same query as the one that will actually get used most often in production. In
the case that the query modifies the table, the user could train the database
over a copy of that table. The data collected by the analyze phase would just
be constant stuff like average and stddev. That would make the job of the
planner / cost estimator easier, right? It could treat the function as a
constant cost function.

> One other point is that we already know that sampling overhead and
> measurement error are significant problems when trying to measure
> intervals on the order of one Plan-node execution. I'm afraid that
> would get a great deal worse if we try to use a similar approach to
> timing individual function calls.

The query could be run with the arguments passed to "myfunc" being recorded to a
temporary table. After the query is complete (and the temporary table
populated), data from the temp table could be pulled into memory in batches,
with the "myfunc" run on them again in a tight loop. The loop itself could be
timed, rather than each iteration. The sum of all the timings for the various
loops would then be the final runtime which would be divided by the total number
of rows to get the average runtime per call. The downside is that I don't see
how you retrieve the standard deviation. (I also don't know if the planner
knows how to use standard deviation information, so perhaps this is a non issue.)

A further refinement would be to batch the inputs based on properties of the
input data. For text, you could run a batch of short text first, a batch of
medium second, and a batch of long text last, and use best-fit linear algebra to
determine the runtime cost vs. input text length function. I'm not sure how
such a refinement would be done for fixed size datatypes. And for some text
functions the runtime won't vary with length but with some other property anyway.

mark

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Koichi Suzuki 2007-01-29 07:15:08 Archive log compression keeping physical log available in the crash recovery
Previous Message Tom Lane 2007-01-29 02:15:32 Re: Function execution costs 'n all that