Re: Built-in binning functions

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-31 19:55:10
Message-ID: 54037D9E.1080904@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 30/08/14 19:24, Tom Lane wrote:
> Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
>> 1. I am thinking so reduction to only numeric types is not necessary -
>> although we can live without it - but there are lot of non numeric
>> categories: chars, date, ...
>
> I wasn't terribly happy about that either. I still think we should
> reduce this to a single polymorphic function, as in the attached.
>

I did try to write generic function very similar to what you wrote but
discarded it because of the performance reasons. If we really want to
support any data type I am all for having generic function but I still
would like to have one optimized for float8 because that seems to be the
most used use-case (at least that one is the reason why I even wrote
this) for performance reasons.

Here are some numbers:
First float8:
CREATE TABLE wbtest AS SELECT (random() * 100)::float8 a FROM
generate_series(1,1000000) ORDER BY 1;

SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::float8[])
FROM wbtest;

Optimized float8: ~250ms
Tom's generic: ~400ms

Numeric:
CREATE TABLE wbtestn AS SELECT (random() * 100)::numeric a FROM
generate_series(1,1000000) ORDER BY 1;

SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::numeric[])
FROM wbtestn;

Optimized numeric: ~600ms
My generic: ~780ms
Tom's generic: ~900ms

The difference between my generic and Tom's generic is because Tom's is
slowed down by the deconstruct_array.

>> 2. Still I strongly afraid about used searching method - there is not
>> possible to check a validity of input. Did you check how much linear
>> searching is slower - you spoke, so you have a real data and real use case?
>
> Well, if we check the input then we'll be doing O(N) comparisons instead
> of O(log N). That could be a serious cost penalty for large arrays and
> expensive comparison functions (such as strcoll()). I think it's probably
> sufficient to have a clear warning in the docs.
>

With resort the performance is worse than bunch of CASE WHEN :(

>
> Also, a documentation quibble: in Petr's patch the documentation and
> comments refer to the thresholds as "right bounds", which I didn't care
> for and replaced with "upper bounds". However, looking at it again,
> I wonder if we would not be better off describing them as "lower bounds".
> They are exclusive bounds if seen as upper bounds, and inclusive if seen
> as lower bounds, and on the whole I think the latter viewpoint might be
> less confusing. Thoughts?

Upper bounds is probably better name than right bounds, I agree with
that. In any case it's upper bound for a bucket (that's why there is one
more bucket to which everything bigger than max threshold goes into).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-08-31 20:03:30 Re: On partitioning
Previous Message Gavin Flower 2014-08-31 19:44:59 Re: Built-in binning functions