PoC: Partial sort

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: PoC: Partial sort
Date: 2013-12-14 09:59:02
Message-ID: CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hackers!

Currently when we need to get ordered result from table we have to choose
one of two approaches: get results from index in exact order we need or do
sort of tuples. However, it could be useful to mix both methods: get
results from index in order which partially meets our requirements and do
rest of work from heap.

Two attached patches are proof of concept for this approach.

*partial-sort-1.patch*

This patch allows to use index for order-by if order-by clause and index
has non-empty common prefix. So, index gives right ordering for first n
order-by columns. In order to provide right order for rest m columns, sort
node is inserted. This sort node sorts groups of tuples where values of
first n order-by columns are equal.

See an example.

create table test as (select id, (random()*10000)::int as v1, random() as
v2 from generate_series(1,1000000) id);
create index test_v1_idx on test (v1);

We've index by v1 column, but we can get results ordered by v1, v2.

postgres=# select * from test order by v1, v2 limit 10;
id | v1 | v2
--------+----+--------------------
390371 | 0 | 0.0284479795955122
674617 | 0 | 0.0322008323855698
881905 | 0 | 0.042586590629071
972877 | 0 | 0.0531588457524776
364903 | 0 | 0.0594307743012905
82333 | 0 | 0.0666455538012087
266488 | 0 | 0.072808934841305
892215 | 0 | 0.0744258034974337
13805 | 0 | 0.0794667331501842
338435 | 0 | 0.171817752998322
(10 rows)

And it's fast using following plan.

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
time=0.097..0.099 rows=10 loops=1)
-> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
time=0.096..0.097 rows=10 loops=1)
Sort Key: v1, v2
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47604.42
rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
Total runtime: 0.125 ms
(6 rows)

For sure, this approach is effective only when first n order-by columns we
selected provides enough count of unique values (so, sorted groups are
small). Patch is only PoC because it doesn't contains any try to estimate
right cost of using partial sort.

*partial-knn-1.patch*

KNN-GiST provides ability to get ordered results from index, but this order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.

See an example.

create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,1000000) id);
create index test_idx on test using gist (p);

We can get results ordered by distance from polygon to point.

postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
point(0.5,0.5) limit 10;
id | ?column?
--------+----------------------
755611 | 0.000405855808916853
807562 | 0.000464123777564343
437778 | 0.000738524708741959
947860 | 0.00076250998760724
389843 | 0.000886362723569568
17586 | 0.000981960100555216
411329 | 0.00145338112316853
894191 | 0.00149399559703506
391907 | 0.0016647896049741
235381 | 0.00167554614889509
(10 rows)

It's fast using just index scan.

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
-> Index Scan using test_idx on test (cost=0.29..157672.29
rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
Order By: (p <-> '(0.5,0.5)'::point)
Total runtime: 0.305 ms
(4 rows)

This patch is also only PoC because of following:
1) It's probably wrong at all to get heap tuple from index scan node. This
work should be done from another node.
2) Assumption that order-by operator returns float8 comparable with GiST
distance method result in general case is wrong.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
partial-sort-1.patch application/octet-stream 21.1 KB
partial-knn-1.patch application/octet-stream 25.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-12-14 12:05:33 Re: pgsql: Fix a couple of bugs in MultiXactId freezing
Previous Message Thomas Munro 2013-12-14 09:46:28 Re: Compression of tables