Re: Built-in binning functions

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-30 17:24:30
Message-ID: 6053.1409419470@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
> 1. I am thinking so reduction to only numeric types is not necessary -
> although we can live without it - but there are lot of non numeric
> categories: chars, date, ...

I wasn't terribly happy about that either. I still think we should
reduce this to a single polymorphic function, as in the attached.

> 2. Still I strongly afraid about used searching method - there is not
> possible to check a validity of input. Did you check how much linear
> searching is slower - you spoke, so you have a real data and real use case?

Well, if we check the input then we'll be doing O(N) comparisons instead
of O(log N). That could be a serious cost penalty for large arrays and
expensive comparison functions (such as strcoll()). I think it's probably
sufficient to have a clear warning in the docs.

> 3. I am thinking about name - I don't think so varwidth_bucket is correct
> -- in relation to name "width_bucket" ... what about "range_bucket" or
> "scope_bucket" ?

Neither of those seem like improvements from here. I agree with the
objections to bin() as well. bucket() might be all right but it still
seems a bit too generic. width_bucket(), which at least shows there's
a relationship to the standard functions, still seems like the best
of the suggestions so far.

It occurs to me that there would be an advantage to using some other
name besides width_bucket: we could then mark the function as variadic,
which might be a notational advantage in some usages. (We can't do
that if we call it width_bucket because the four-parameter case would
be ambiguous with the existing functions.) I'm not sure that this is
important enough to justify changing the name though, especially if
we can't come up with a good name. Also, doing that would put a very
large premium on picking a non-generic name, else we'd risk creating
ambiguities against user-defined functions.

Also, a documentation quibble: in Petr's patch the documentation and
comments refer to the thresholds as "right bounds", which I didn't care
for and replaced with "upper bounds". However, looking at it again,
I wonder if we would not be better off describing them as "lower bounds".
They are exclusive bounds if seen as upper bounds, and inclusive if seen
as lower bounds, and on the whole I think the latter viewpoint might be
less confusing. Thoughts?

Proposed patch with 1 polymorphic function attached.

regards, tom lane

Attachment Content-Type Size
binning-fns-v4.patch text/x-diff 14.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-08-30 17:50:40 Re: postgresql latency & bgwriter not doing its job
Previous Message Andres Freund 2014-08-30 17:19:59 Re: postgresql latency & bgwriter not doing its job