Re: Built-in binning functions

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-29 18:20:44
Message-ID: CAFj8pRBOuoD3ntt-M_sFVwTsRes3FDwTiDiTnu8-MYBBGX32tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

I am looking to your last patch and I have a few questions, notes

1. I am thinking so reduction to only numeric types is not necessary -
although we can live without it - but there are lot of non numeric
categories: chars, date, ...

2. Still I strongly afraid about used searching method - there is not
possible to check a validity of input. Did you check how much linear
searching is slower - you spoke, so you have a real data and real use case?
Used methods can returns wrong result without any warning, what is
acceptable for extensions, but I am not sure, for core feature.

3. I am thinking about name - I don't think so varwidth_bucket is correct
-- in relation to name "width_bucket" ... what about "range_bucket" or
"scope_bucket" ?

last patch is very simple, there are no new compilation or regress tests
issues.

Regards

Pavel

2014-08-25 16:15 GMT+02:00 Petr Jelinek <petr(at)2ndquadrant(dot)com>:

> Hi,
>
> I finally had some time to get back to this.
>
> I attached version3 of the patch which "fixes" Tom's complaint about int8
> version by removing the int8 version as it does not seem necessary (the
> float8 can handle integers just fine).
>
> This patch now basically has just one optimized function for float8 and
> one for numeric datatypes, just like width_bucket.
>
> On 08/07/14 02:14, Tom Lane wrote:
>> Also, I'm not convinced by this business of throwing an error for a
>> NULL array element. Per spec, null arguments to width_bucket()
>> produce a null result, not an error --- shouldn't this flavor act
>> similarly? In any case I think the test needs to use
>> array_contains_nulls() not just ARR_HASNULL.
>>
>
> I really think there should be difference between NULL array and NULL
> inside array and that NULL inside the array is wrong input. I changed the
> check to array_contains_nulls() though.
>
> On 08/07/14 02:14, Tom Lane wrote:
>> Lastly, the spec defines behaviors for width_bucket that allow either
>> ascending or descending buckets. We could possibly do something
>> similar
>>
>
> I am not sure it's worth it here as we require input to be sorted anyway.
> It might be worthwhile if we decided to do this as an aggregate (since
> there input would not be required to be presorted) instead of function but
> I am not sure if that's desirable either as that would limit usability to
> only the single main use-case (group by and count()).
>
>
>
> On 20/07/14 11:01, Simon Riggs wrote:
>
>> On 16 July 2014 20:35, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>>
>>>
>>> On 08/07/14 02:14, Tom Lane wrote:
>>>>
>>>>>
>>>>> I didn't see any discussion of the naming question in this thread.
>>>>> I'd like to propose that it should be just "width_bucket()"; we can
>>>>> easily determine which function is meant, considering that the
>>>>> SQL-spec variants don't take arrays and don't even have the same
>>>>> number of actual arguments.
>>>>>
>>>>
>>>> I did mention in submission that the names are up for discussion, I am
>>>> all
>>>> for naming it just width_bucket.
>>>>
>>>
>>> I had this idea too - but I am not sure if it is good idea. A distance
>>> between ANSI SQL with_bucket and our enhancing is larger than in our
>>> implementation of "median" for example.
>>>
>>> I can live with both names, but current name I prefer.
>>>
>>
>>
>> So I suggest that we use the more generic function name bin(), with a
>> second choice of discretize() (though that seems annoyingly easy to
>> spell incorrectly)
>>
>>
> I really don't think bin() is good choice here, bucket has same meaning in
> this context and bin is often used as shorthand for binary which would be
> confusing here.
>
> Anyway I currently left the name as it is, I would not be against using
> width_bucket() or even just bucket(), not sure about the discretize()
> option.
>
>
> --
> Petr Jelinek http://www.2ndQuadrant.com/
>
> PostgreSQL Development, 24x7 Support, Training & Services
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2014-08-29 18:23:27 Re: 9.5: Better memory accounting, towards memory-bounded HashAgg
Previous Message Hannu Krosing 2014-08-29 18:12:16 Re: On partitioning