Re: Review Report: propose to include 3 new functions into intarray and intagg

From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: dmitry(at)koterov(dot)ru, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Review Report: propose to include 3 new functions into intarray and intagg
Date: 2008-09-15 12:38:40
Message-ID: 48CE5750.6070309@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

sorry for not having completed this review, yet. As you are obviously
looking at the patch as well, I'll try to quickly write down my points
so far.

Trying to compile the intarray module, I now receive an error:

error: ‘INT4OID’ undeclared (first use in this function)

That can be solved by including "catalog/pg_type.h" from
contrib/intarr/_int_op.c.

The PG_FUNCTION_INFO_V1 and prototype definition certainly belong to the
top of the file, where all others are.

Some lines are longer than 80 columns and again comments are a bit
sparse or even useless (no "additional things", please).

Heikki Linnakangas wrote:
> I find it a bit unfriendly to have a function that depends on sorted
> input, but doesn't check it. But that's probably not a good enough
> reason to reject an otherwise simple and useful function. Also, we
> already have uniq, which doesn't strictly speaking require sorted input,
> but it does if you want to eliminate all duplicates from the array.

I think it's a performance optimization which is absolutely required in
some cases. Some time ago I've also had to rip out the sorting step from
certain intarray module functions to save processing time.

One option already mentioned somewhere would be saving a 'sorted'
property for the array. Then again, I think such a thing would certainly
have to be done globally, for all kinds of arrays.

> _int_group_count_sort seems a bit special purpose. Why does it bother to
> sort the output? That's wasted time if you don't need sorted output, or
> if you want the array sorted by the integer value instead of frequency.
> If you want sorted output, you can just sort it afterwards.

Agreed. IMO the normal GROUP BY and ORDER BY stuff of the database
itself should be used for such a thing. However, that means turning an
array into a set of rows...

> Also, it's requiring sorted input for a small performance gain, but
> there's a lot more precedence in the existing intarray functions to not
> require sorted input, but to sort the input instead (union, intersect,
> same, overlap).

..and exactly these are the functions I had to wrap again to strip the
sorting step, due to poor performance for known-sorted arrays.

> I realize that the current implementation is faster for the use case
> where the input is sorted, and output needs to be sorted, but if we go
> down that path we'll soon have dozens of different variants of various
> functions, with different ordering requirements of inputs and outputs.

Agreed.

However, given the OP is using that in production, there seems to be a
use case for the optimization, where we have none for the same function
without it.

> So, I'd suggest changing _int_group_count_sort so that it doesn't
> require sorted input, and doesn't sort the output. The binary search
> function looks good to me (I think I'd prefer naming it bsearch(),
> though, though I can see that it was named bidx in reference to the
> existing idx function). Also, as Markus pointed out, the SGML docs need
> to be updated.

As is, it should probably also carry the '_int' prefix, because it's not
a general purpose array function. So propose to name it '_int_bsearch'.

Overall I think these functions are overly specialized and should be
replaced be more general counterparts in core. However, until we have
that, it's hard to refuse such a thing for contrib.

By that reasoning, however, the intarray would have to provide methods
for sorted as well as asorted input arrays as well.

I'm closing my part of reviewing of this patch now. Dmitry, how do you
want to proceed with these patches? Do you have time for some cleaning
up and writing documentation?

Regards

Markus Wanner

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-09-15 12:40:19 Re: Transaction Snapshots and Hot Standby
Previous Message Gregory Stark 2008-09-15 12:13:04 Re: Transaction Snapshots and Hot Standby