Re: Sum aggregate calculation for single precsion real

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Sum aggregate calculation for single precsion real
Date: 2017-02-15 07:40:30
Message-ID: 903a8b6f-8b96-1cfa-e8a3-4656a87eb89c@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14.02.2017 16:59, Jim Nasby wrote:
> On 2/13/17 10:45 AM, Konstantin Knizhnik wrote:
>> It is not true - please notice query execution time of this two queries:
>
> I bet you'd get even less difference if you simply cast to float8
> instead of adding 0.0. Same result, no floating point addition.
>
>>> The expectation for SUM(float4) is that you want speed and are
>>> prepared to cope with the consequences. It's easy enough to cast your
>>> input to float8 if you want a wider accumulator, or to numeric if
>>> you'd like more stable (not necessarily more accurate :-() results.
>>> I do not think it's the database's job to make those choices for you.
>>
>> From my point of your it is strange and wrong expectation.
>> I am choosing "float4" type for a column just because it is enough to
>> represent range of data I have and I need to minimize size of record.
>
> In other words, you've decided to trade accuracy for performance...

Could not agree with it...
1. If I choose float4 type to store bid price (which usually has 5-6
significant digits) - I do not loose precision and accuracy is not suffered.
The accuracy is important when I am calculating sum of prices. But here
the assumption that accuracy of sum calculation should depend on type of
summed field
is non obvious. May be it is more or less clear for C programmers but
not for SQL users.
In all database I have tested SUM of single precision floats is
calculated at least using double precision numbers (or using numeric type).

2. There is no huge gap in performance between accumulating in float4
and float8. There are no "orders of magnitude":
postgres=# select sum(l_quantity) from lineitem_projection;
sum
-------------
1.07374e+09
(1 row)

Time: 4659.509 ms (00:04.660)

postgres=# select sum(l_quantity::float8) from lineitem_projection;
sum
------------
1529738036
(1 row)

Time: 5465.320 ms (00:05.465)

So do not think that there is actually compromise here between
performance and accuracy.
But current implementation cause leads to many confusions and
contradictions with users expectations:

1. The fact that sum(l_quantity) and sum(l_quantity::float8) are
absolutely different (1.5 times!!! - we loose 0.5 milliard dollars:)
2. avg(l_quantity)*count(l_quantity) is not equal to sum(l_quantity) But
in case of casting to float8 result is the same.
3. sum of aggregates for groups is not equal to total sum (once again no
problem for float8 type_/

>> But when I am calculating sum, I expect to receive more or less precise
>> result. Certainly I realize that even in case of using double it is
>
> ... but now you want to trade performance for accuracy? Why would you
> expect the database to magically come to that conclusion?

Se above. No trading here. Please notice that current Postgres
implementation of AVG aggregates calculates at sum and sum of squares
even if last one is not needed for AVG.
The comment in the code says:

* It might seem attractive to optimize this by having multiple accumulator
* functions that only calculate the sums actually needed. But on most
* modern machines, a couple of extra floating-point multiplies will be
* insignificant compared to the other per-tuple overhead, so I've chosen
* to minimize code space instead.

And it is true!
In the addition to the results above I can add AVG timing for AVG
calculation:

postgres=# select avg(l_quantity) from lineitem_projection;
avg
------------------
25.5015621964919
(1 row)

postgres=# select avg(l_quantity::float8) from lineitem_projection;
avg
------------------
25.5015621964919
(1 row)

Please notice that avg for float is calculated using float4_accum which
use float8 accumulator and also calculates sumX2!

Time: 6103.807 ms (00:06.104)

So I do not see reasonable arguments here for using float4pl for
sum(float4)!
And I do not know any database which has such strange behavior.
I know that "be as others" or especially "be as Oracle" are never good
argument for Postgres community but doing something differently (and
IMHO wrong) without any significant reasons seems to be very strange.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Seki, Eiji 2017-02-15 07:55:23 Re: Proposal: GetOldestXminExtend for ignoring arbitrary vacuum flags
Previous Message Ideriha, Takeshi 2017-02-15 07:39:03 Re: GUC for cleanup indexes threshold.