Re: Fix for seg picksplit function

From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fix for seg picksplit function
Date: 2010-11-30 19:57:39
Message-ID: 4CF55733.7000403@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2010-11-16 09:57, Alexander Korotkov wrote:
> On Tue, Nov 16, 2010 at 3:07 AM, Robert Haas <robertmhaas(at)gmail(dot)com
> <mailto:robertmhaas(at)gmail(dot)com>> wrote:
>
> The loop that begins here:
>
> for (i = 0; i < maxoff; i++)
> {
> /* First half of segs goes to the left datum. */
> if (i < seed_2)
>
> ...looks like it should perhaps be broken into two separate loops.
> That might also help tweak the logic in a way that eliminates this:
>
> seg.c: In function ‘gseg_picksplit’:
> seg.c:327: warning: ‘datum_r’ may be used uninitialized in this
> function
> seg.c:326: warning: ‘datum_l’ may be used uninitialized in this
> function
>
> I restored original version of that loop.
>
> But on a broader note, I'm not very certain the sorting algorithm is
> sensible. For example, suppose you have 10 segments that are exactly
> '0' and 20 segments that are exactly '1'. Maybe I'm misunderstanding,
> but it seems like this will result in a 15/15 split when we almost
> certainly want a 10/20 split. I think there will be problems in more
> complex cases as well. The documentation says about the less-than and
> greater-than operators that "These operators do not make a lot of
> sense for any practical purpose but sorting."
>
> I think almost any split algorithm has corner cases when it's results
> don't look very good. I think the way to understand significance of
> these corner cases for real life is to perform sufficient testing
> on datasets which is close to real life. I'm not feeling power to
> propose enough of test datasets and estimate their significance for
> real life cases, and I need help in this field.
I think it is time to mark this patch ready for committer:

The unintuitive result thus far is that sorting outperforms the R-tree
bounding boxes style index, as Alexander has demonstrated with several
different distributions on 20-11 (uniform, natural (is that a bell
curve?), many distinct values)

My personal opinion is that I like the single loop for walking over the
sort array (aka gbt_num_picksplit) more than the two different ones, but
I'm in the minority here.

Two remarks on this patch also apply to other picksplit functions
currently around:
1) the *first = *last = FirstOffsetNumber assignment, as noted by
Alvaro, is not necessary for anymore, and Oleg confirmed this is true
since PostgreSQL > 7.x. 2) loops over something other than the
entryvector better not use FirstOffsetNumber, OffsetNumberNext, as
indicated by Tom.

If this patch is committed, it might be an idea to change the other
occurences as well.

regards,
Yeb Havinga

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2010-11-30 20:05:07 Re: Another proposal for table synonyms
Previous Message Tom Lane 2010-11-30 19:50:46 KNNGIST next step: adjusting indexAM API