Re: Fix for seg picksplit function

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Yeb Havinga <yebhavinga(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fix for seg picksplit function
Date: 2010-11-16 11:07:58
Message-ID: AANLkTikbEpKcury9-YeudKBVs71iQWZsv2-GCfAtC7kx@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 16, 2010 at 3:07 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> But on a broader note, I'm not very certain the sorting algorithm is
> sensible. For example, suppose you have 10 segments that are exactly
> '0' and 20 segments that are exactly '1'. Maybe I'm misunderstanding,
> but it seems like this will result in a 15/15 split when we almost
> certainly want a 10/20 split. I think there will be problems in more
> complex cases as well. The documentation says about the less-than and
> greater-than operators that "These operators do not make a lot of
> sense for any practical purpose but sorting."

In order to illustrate a real problem we should think about
gist behavior with great enough amount of data. For example, I tried to
extrapolate this case to 100000 of segs where 40% are (0,1) segs and 60% are
(1,2) segs. And this case doesn't seem a problem for me.
Here are the details.

test=# insert into seg_test (select case when random()<0.4 then '0..1'::seg
else '1..2'::seg end from generate_series(1,100000));
INSERT 0 100000
Time: 7543,941 ms

test=# create index seg_test_idx on seg_test using gist(a);
CREATE INDEX
Time: 17839,450 ms

test=# explain (analyze, buffers) select count(*) from seg_test where a @>
'1.5'::seg;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=344.84..344.85 rows=1 width=0) (actual
time=186.192..186.193 rows=1 loops=1)
Buffers: shared hit=887
-> Bitmap Heap Scan on seg_test (cost=5.05..344.59 rows=100 width=0)
(actual time=16.066..102.586 rows=60206 l
oops=1)
Recheck Cond: (a @> '1.5'::seg)
Buffers: shared hit=887
-> Bitmap Index Scan on seg_test_idx (cost=0.00..5.03 rows=100
width=0) (actual time=15.948..15.948 rows
=60206 loops=1)
Index Cond: (a @> '1.5'::seg)
Buffers: shared hit=396
Total runtime: 186.306 ms
(9 rows)

test=# explain (analyze, buffers) select count(*) from seg_test where a @>
'0.5'::seg;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=344.84..344.85 rows=1 width=0) (actual
time=144.293..144.295 rows=1 loops=1)
Buffers: shared hit=754
-> Bitmap Heap Scan on seg_test (cost=5.05..344.59 rows=100 width=0)
(actual time=28.926..87.625 rows=39794 lo
ops=1)
Recheck Cond: (a @> '0.5'::seg)
Buffers: shared hit=754
-> Bitmap Index Scan on seg_test_idx (cost=0.00..5.03 rows=100
width=0) (actual time=26.969..26.969 rows
=39794 loops=1)
Index Cond: (a @> '0.5'::seg)
Buffers: shared hit=263
Total runtime: 144.374 ms
(9 rows)

test=# select pg_relpages('seg_test_idx');
pg_relpages
-------------
656
(1 row)

Total number of pages in index is 656 and number of pages used in scans
is 263+396=659. Only 3 pages overhead.
We can see the distribution of segs in index using gevel.

test=# select a, count(1) from gist_print('seg_test_idx') as t(level int,
valid bool, a seg) group by a;
a | count
--------+-------
0 .. 1 | 40054
0 .. 2 | 2
1 .. 2 | 60599
(3 rows)

test=# select level, a, count(1) from gist_print('seg_test_idx') as t(level
int, valid bool, a seg) group by level,a order by level, a;
level | a | count
-------+--------+-------
1 | 0 .. 1 | 1
1 | 0 .. 2 | 1
1 | 1 .. 2 | 1
2 | 0 .. 1 | 259
2 | 0 .. 2 | 1
2 | 1 .. 2 | 392
3 | 0 .. 1 | 39794
3 | 1 .. 2 | 60206
(8 rows)

We have only 2 segs '0..2' (one on the first level and one on the second)
and all other segs are '0..1' and '1..2'. I think there will be more
problems when we'll have many of distinct values where each have count value
about half of index page.

----
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2010-11-16 11:39:35 Re: Re: [BUGS] BUG #5650: Postgres service showing as stopped when in fact it is running
Previous Message Dave Page 2010-11-16 10:25:02 Re: Isn't HANDLE 64 bits on Win64?