cost_sort() improvements

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: cost_sort() improvements
Date: 2018-06-28 16:47:22
Message-ID: 803ccdce-39ef-f1c3-3c65-c79959f7081d@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

Current estimation of sort cost has following issues:
- it doesn't differ one and many columns sort
- it doesn't pay attention to comparison function cost and column width
- it doesn't try to count number of calls of comparison function on per column
basis

At least [1] and [2] hit into to that issues and have an objections/questions
about correctness of cost sort estimation. Suggested patch tries to improve
current estimation and solve that issues.

Basic ideas:
- let me introduce some designations
i - index of column, started with 1
N - number of tuples to sort
Gi - number of groups, calculated by i number columns. Obviously,
G0 == 1. Number of groups is caculated by estimate_num_groups().
NGi - number of tuples in one group. NG0 == N and NGi = N/G(i-1)
Fi - cost of comparison function call of i-th column
Wi - average width in bytes of 1-ith column.
Ci - total cost of one comparison call of column calculated as
Fi * fine_function(Wi) where fine_function(Wi) looks like:
if (Wi <= sizeof(Datum))
return 1.0; //any type passed in Datum
else
return 1 + log16(Wi/sizeof(Datum));
log16 just an empirical choice
- So, cost for first column is 2.0 * C0 * log2(N)
First column is always compared, multiplier 2.0 is defined per regression
test. Seems, it estimates a cost for transferring tuple to CPU cache,
starting cost for unpacking tuple, cost of call qsort compare wrapper
function, etc. Removing this multiplier causes too optimistic prediction of
cost.
- cost for i-th columns:
Ci * log2(NGi) => Ci * log2(N/G(i-1))
So, here I believe, that i-th comparison function will be called only inside
group which number is defined by (i-1) leading columns. Note, following
discussion [1] I add multiplier 1.5 here to be closer to worst case when
groups are significantly differ. Right now there is no way to determine how
differ they could be. Note, this formula describes cost of first column too
except difference with multiplier.
- Final cost is cpu_operator_cost * N * sum(per column costs described above).
Note, for single column with width <= sizeof(datum) and F1 = 1 this formula
gives exactly the same result as current one.
- for Top-N sort empiric is close to old one: use 2.0 multiplier as constant
under log2, and use log2(Min(NGi, output_tuples)) for second and following
columns.

I believe, suggested method could be easy enhanced to support [1] and used in [2].

Open items:
- Fake Var. I faced with generate_append_tlist() which generates Var with
varno = 0, it was invented, as I can see, at 2002 with comment 'should be
changed in future'. Future hasn't yet come... I've added workaround to
prevent call estimate_num_group() with such Vars, but I'm not sure that is
enough.
- Costs of all comparison functions now are the same and equal to 1. May be,
it's time to change that.
- Empiric constants. I know, it's impossible to remove them at all, but, may
be, we need to find better estimation of them.

[1] https://commitfest.postgresql.org/18/1124/
[2] https://commitfest.postgresql.org/18/1651/

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

Attachment Content-Type Size
cost_sort-v6.patch text/x-patch 9.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2018-06-28 16:52:03 Re: POC: GROUP BY optimization
Previous Message Peter Geoghegan 2018-06-28 16:46:17 Re: Tips on committing