[PATCH] kNN for btree

From: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
To: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: [PATCH] kNN for btree
Date: 2017-01-18 01:42:01
Message-ID: ce35e97b-cf34-3f5d-6b99-2c25bae49999@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi hackers,

I'd like to present a series of patches that implements k-Nearest
Neighbors (kNN)
search forbtree, which can be usedto speed up ORDER BY distance queries
like this:
SELECT * FROM eventsORDER BY date <-> '2000-01-01'::date ASC LIMIT 100;
Now only GiST supports kNN, but kNN on btree can be emulated using
contrib/btree_gist.

Scanning algorithm
==================

Algorithm is very simple: we use bidirectional B-tree index scan
starting at the point
from which we measure the distance(target point).At each step, we advance
this scan in the directionthat has the nearest point. Butwhen the
target point
does not fall into the scannedrange, we don't even need to
useabidirectional
scan here --- wecan use ordinaryunidirectional scaninthe right direction.

Performance results
===================

Test database istaken from original kNN-GiST presentation (PGCon2010).

Test query

SELECT * FROM events ORDER BY date <-> '1957-10-04'::date ASC LIMIT k;

can be optimizedto the next rather complicated UNION form, whichnolonger
requireskNN:

WITH
t1 AS (SELECT * FROM events WHERE date >= '1957-10-04'::date ORDER BY
date ASC LIMIT k),
t2 AS (SELECT * FROM events WHERE date < '1957-10-04'::date ORDER BY
date DESC LIMIT k),
t AS (SELECT * FROM t1 UNION SELECT * FROM t2)
SELECT * FROM t ORDER BY date <-> '1957-10-04'::date ASC LIMIT k;

In each cell of this table shown query execution time in milliseconds and
the number of accessed blocks:

k | kNN-btree | kNN-GiST| Opt. query | Seq.scan
| | (btree_gist) | withUNION | with sort
-------|--------------|--------------|---------------|------------
1 | 0.0414| 0.0794| 0.0608 |41.11824
10 | 0.0487 | 0.0919 | 0.09717 |41.81824
100 | 0.10747 | 0.192 52| 0.342104 |42.31824
1000 | 0.735573 | 0.913 650 | 2.9701160 |43.51824
10000 | 5.070 5622| 6.240 6760| 36.300 11031 | 54.1 1824
100000 | 49.600 51608| 61.900 64194 | 295.100 94980| 115.0 1824

As you can see, kNN-btree can be two times faster than kNN-GiST(btree_gist)
when k < 1000,but the number of blocks readis roughly the same.

Implementation details
======================

A brief description is given below for each of the patches:

1. Introduce amcanorderbyop() function

This patch transformsexisting boolean AMpropertyamcanorderbyop into a
method
(function pointer).This is necessarybecause, unlike GiST,kNN for
btreesupports
only a one ordering operator onthe first index column and we need a
different pathkey
matching logicfor btree (there was acorresponding comment
inmatch_pathkeys_to_index()).
GiST-specific logic has been moved from match_pathkeys_to_index() to
gistcanorderbyop().

2. Extract substructure BTScanState from BTScanOpaque

This refactoringis necessary for bidirectional kNN-scanimplementation. Now,
BTScanOpaque'ssubstructure BTScanState containing only thefields related
toscanpositionis passed to some functions where thewhole BTScanOpaque
waspassedpreviously.

3. Extract get_index_column_opclass(),
get_opclass_opfamily_and_input_type().

Extracted two simple common functions usedingistproperty() and
btproperty() (see the next patch).

4. Add kNN supportto btree

* Added additional optional BTScanState to BTScanOpaque for
bidirectional kNN scan.
* Implemented bidirectional kNN scan.
* Implemented logic for selecting kNN strategy
* Implemented btcanorderbyop(), updated btproperty() and btvalidate()

B-tree user interface functions have not been altered because ordering
operators
are used directly.

5. Add distance operators for sometypes

These operators for integer, float, date, time, timestamp, interval,
cash and oidtypes
havebeencopied fromcontrib/btree_gistand added to the existing btree
opclasses
as ordering operators. Their btree_gist duplicates areremoved in the
next patch.

6. Remove duplicate distance operators from contrib/btree_gist.

References to their own distance operators in btree_gist opclassesare
replaced
with references to the built-in operatorsand thanduplicate operators are
dropped.
But if the user is using somewhere these operators, upgrade of btree_gist
from 1.3 to 1.4 should fail.

7. Add regression tests for btree kNN.

Tests were added only after the built-in distance operators were added.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
0001-introduce-amcanorderby-function-v01.patch text/x-patch 13.6 KB
0002-extract-structure-BTScanState-v01.patch text/x-patch 37.4 KB
0003-extract-get_index_column_opclass-and-get_opclass_opfamily_and_input_type-v01.patch text/x-patch 5.1 KB
0004-add-kNN-support-to-btree-v01.patch text/x-patch 38.0 KB
0005-add-distance-operators-v01.patch text/x-patch 20.9 KB
0006-remove-distance-operators-from-btree_gist-v01.patch text/x-patch 97.9 KB
0007-add-regression-tests-for-kNN-btree-v01.patch text/x-patch 15.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-01-18 01:54:36 Re: increasing the default WAL segment size
Previous Message Michael Paquier 2017-01-18 01:37:34 Re: Re: Clarifying "server starting" messaging in pg_ctl start without --wait