| From: | Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl> | 
|---|---|
| To: | Martijn van Oosterhout <kleptog(at)svana(dot)org> | 
| Cc: | pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Re: why is gist index taking so much space on the disc | 
| Date: | 2005-11-21 19:14:44 | 
| Message-ID: | FCB2D3B8-9CCD-404E-8728-45CDE4DBE93F@pointblue.com.pl | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 2005-11-21, at 19:32, Martijn van Oosterhout wrote:
> On Mon, Nov 21, 2005 at 04:58:25PM +0100, Grzegorz Jaskiewicz wrote:
>> my conquers with Gist index for custom type are nearly finished. It
>> is working as it is now, but there are few problems here and there.
>> One of em, being amount of disc space index it self takes. The type
>> stucture it self takes 160bytes. Adding 100.000 rows into table -
>> CREATE TABLE blah (a serial, b customType);
>
> Let's see, 160bytes means you'll get aboud 50 keys per page. So you
> would expect 2000 leaf page, 40 level 1 pages. This should be less  
> than
> 20-30MB
>
yep;
> You mean you sometimes put the same elements in the two halves? You
> shouldn't do that. The whole point is that the search will descend any
> node that matches consistant, but any single key should only appear
> once in each index.
>
> picksplit should *split* the set, not return two sets about the same
> size as you started...
Nope, I mean that 'masks' created to match either 'half' sometimes  
match elements in the other one.
This shouldn't be a big deal, just one level to go down on query to  
much more specific result set.
I have fixed that with, somewhat hack.
Here's some pseudo code
sort_data(input);
find_split_point(input);
mask1 = generate_two_masks(input[0]);
mask2 = generate_two_masks(input[1]);
foreach(input) {
	bool a = matches1(input);
	bool b = matches2(input);
	if ( a && b ) {
		if ( left_index == 0 ) {
			left[left_index++] = input;	
		}
		else {
			if ( right_index == 0 ) {
				right[right_index++] = input;
				continue;
			}
/* this part is new code, and helped a lot, now gist index takes much  
less space
and is much faster, because of lower I/O consumption*/
			if ( loop_index % 2 ) {
				right[right_index++] = input;
			}
			else {
				left[left_index++] = input;
			}
		}
	}
	else {
		if ( a) left[left_index++] = input;
		if ( b) right[right_index++] = input;
	}
}
mask1 = generate(left );
mask2 = generate(right);
return (left, right, blah, blih, others);
Ok, so the part with i%2 helped a lot, it distributes elements  
matching both masks evenly.
Thanks guys.
I will play with k-means, and see if they will work better with no  
hacks. Either way, I have to have some code that will handle "matches  
both" case.
Thanks again.
-- 
GJ
"If we knew what we were doing, it wouldn't be called Research, would  
it?" - AE
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Marc Munro | 2005-11-21 19:19:59 | Re: [pgsql-hackers] Daily digest v1.5568 (24 messages) | 
| Previous Message | Kevin McArthur | 2005-11-21 19:08:36 | Re: why is gist index taking so much space on the disc |