Re: intarray internals

From: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: intarray internals
Date: 2006-05-11 08:25:30
Message-ID: 20060511082529.GA223@alamut
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Hi,

First, thanks so much for your reply.

On May 10 04:01, Teodor Sigaev wrote:
> > Again, in g_int_decompress(), I couldn't figure out the functionality of
> > below lines:
>
> gist__int_ops use rangeset compression technique, read about in "THE
> RD-TREE: AN INDEX STRUCTURE FOR SETS", Joseph M. Hellerstein,
> http://www.sai.msu.su/~megera/postgres/gist/papers/rd-tree.ps

Thanks so much for the papers. I read the related section (and will read
whole today or tomorrow).

> * intarray_union.patch.0 - doesn't applied, but make small optimization to
> reduce number non-unique values. I don't believe that one pass through
> array with a lot of ifs will be faster than two pass with simple ifs. Did
> you some tests?

IMHO, the only significant improvement in my proposal about _int_union()
is that this method will visit arrays only once (with extra price of x2
condition checks), while current one will also make a second visit to
arrays to remove duplicates (with small condition checks).

You can be right, maybe it doesn't worth for worrying about. Improvement
(if there's any) will probably be available to see for very long arrays.
(Sorry, no tests for this proposal.)

> * intarray_same.patch.0 - move SORT as you suggest, but don't touch
> algorithm.
> 1) if (A[0] == B[0] && A[1] == B[1] && ...)
>
> 2) if (A[0] == B[0] && A[ N] == B[ N] &&
> A[1] == B[1] && A[N-1] == B[N-1] &&
> ...)
>
> Why are you sure that second a much faster? Did you make tests? Number of
> comparisons is the same...

Yep, both algorithms have O(n) comparisions in their worst cases. But
for general purposes, AFAICS, second one will perform better. For
instance consider below examples:

[Best case for 2nd algo.]
Input : 1, 2, 3, ..., 6, 7, *9
1st algo.: O(n)
2nd algo.: O(1)

[Worst case for 2nd algo.]
Input : 1, 2, 3, 4, *4, 6, 7, 8, 9
1st algo.: O(n/2)
2nd algo.: O(n)

But as you can see, because of our arrays are sorted, any missing (or
additional) element in the target array will produce a padding in the
end of the array --- assuming that arrays generally don't hold
duplicate values. Therefore, making comparisons for the tail elements
will perform better beucause of the unmatched values caused by padding.

Hope I managed to explain what I try to mean. Actually, IIRC, I saw this
method (both hacks for small sized arrays and comparisons for the tail
elements of a sorted array) in another FOSS project's source code ---
probably PHP, but I'm not sure.

For about testing, if you'd supply suitable inputs there occurs a quite
much performance improve.

> * intarray_sort.patch.0 - doesn't applied. isort() is very often called for
> already sorted and unique arrays (which comes from index), so it should be
> fast as possible for sorted arrays.

Uh, sorry. I missed that point.

> As I remember ordered array is a worst
> case for qsort(). May be, it will be better choice to use mergesort.

I'll investigate alternative methods to sort already sorted arrays.

Regards.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Sim Zacks 2006-05-11 09:27:26 Re: understanding explain data
Previous Message Anastasios Hatzis 2006-05-11 08:16:41 Debugging SQL queries

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2006-05-11 08:30:25 Re: [HACKERS] Big IN() clauses etc : feature proposal
Previous Message Zeugswetter Andreas DCP SD 2006-05-11 07:55:15 Re: [HACKERS] Big IN() clauses etc : feature proposal