From: | Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl> |
---|---|
To: | Martijn van Oosterhout <kleptog(at)svana(dot)org> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: why is gist index taking so much space on the disc |
Date: | 2005-11-21 19:14:44 |
Message-ID: | FCB2D3B8-9CCD-404E-8728-45CDE4DBE93F@pointblue.com.pl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2005-11-21, at 19:32, Martijn van Oosterhout wrote:
> On Mon, Nov 21, 2005 at 04:58:25PM +0100, Grzegorz Jaskiewicz wrote:
>> my conquers with Gist index for custom type are nearly finished. It
>> is working as it is now, but there are few problems here and there.
>> One of em, being amount of disc space index it self takes. The type
>> stucture it self takes 160bytes. Adding 100.000 rows into table -
>> CREATE TABLE blah (a serial, b customType);
>
> Let's see, 160bytes means you'll get aboud 50 keys per page. So you
> would expect 2000 leaf page, 40 level 1 pages. This should be less
> than
> 20-30MB
>
yep;
> You mean you sometimes put the same elements in the two halves? You
> shouldn't do that. The whole point is that the search will descend any
> node that matches consistant, but any single key should only appear
> once in each index.
>
> picksplit should *split* the set, not return two sets about the same
> size as you started...
Nope, I mean that 'masks' created to match either 'half' sometimes
match elements in the other one.
This shouldn't be a big deal, just one level to go down on query to
much more specific result set.
I have fixed that with, somewhat hack.
Here's some pseudo code
sort_data(input);
find_split_point(input);
mask1 = generate_two_masks(input[0]);
mask2 = generate_two_masks(input[1]);
foreach(input) {
bool a = matches1(input);
bool b = matches2(input);
if ( a && b ) {
if ( left_index == 0 ) {
left[left_index++] = input;
}
else {
if ( right_index == 0 ) {
right[right_index++] = input;
continue;
}
/* this part is new code, and helped a lot, now gist index takes much
less space
and is much faster, because of lower I/O consumption*/
if ( loop_index % 2 ) {
right[right_index++] = input;
}
else {
left[left_index++] = input;
}
}
}
else {
if ( a) left[left_index++] = input;
if ( b) right[right_index++] = input;
}
}
mask1 = generate(left );
mask2 = generate(right);
return (left, right, blah, blih, others);
Ok, so the part with i%2 helped a lot, it distributes elements
matching both masks evenly.
Thanks guys.
I will play with k-means, and see if they will work better with no
hacks. Either way, I have to have some code that will handle "matches
both" case.
Thanks again.
--
GJ
"If we knew what we were doing, it wouldn't be called Research, would
it?" - AE
From | Date | Subject | |
---|---|---|---|
Next Message | Marc Munro | 2005-11-21 19:19:59 | Re: [pgsql-hackers] Daily digest v1.5568 (24 messages) |
Previous Message | Kevin McArthur | 2005-11-21 19:08:36 | Re: why is gist index taking so much space on the disc |