Re: [PATCH] kNN for btree

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] kNN for btree
Date: 2017-02-16 13:05:54
Message-ID: CAPpHfdtvTg+y5aAq3C5dF8HTC8P21iHoAE-6ur3Li91wWpDEmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Nikita!

I have assigned as a reviewer for this patchset. I took a fist look to
these patches.
At first, I'd like to notice that it's very cool that you picked up this
work. I frequently hear people complains about lack of this feature.

k | kNN-btree | kNN-GiST | Opt. query | Seq. scan
> | | (btree_gist) | with UNION | with sort
> --------|--------------|--------------|---------------|------------
> 1 | 0.041 4 | 0.079 4 | 0.060 8 | 41.1 1824
> 10 | 0.048 7 | 0.091 9 | 0.097 17 | 41.8 1824
> 100 | 0.107 47 | 0.192 52 | 0.342 104 | 42.3 1824
> 1000 | 0.735 573 | 0.913 650 | 2.970 1160 | 43.5 1824
> 10000 | 5.070 5622 | 6.240 6760 | 36.300 11031 | 54.1 1824
> 100000 | 49.600 51608 | 61.900 64194 | 295.100 94980 | 115.0 1824

These results looks quite expected. KNN-btree uses about half of blocks in
comparison with UNION query, and it's more than twice faster. In
comparison with kNN-GiST there is still some win.

1. Introduce amcanorderbyop() function
>
> This patch transforms existing boolean AM property amcanorderbyop into a
> method
> (function pointer). This is necessary because, unlike GiST, kNN for btree
> supports only a one ordering operator on the first index column and we
> need a
> different pathkey matching logic for btree (there was a corresponding
> comment
> in match_pathkeys_to_index()). GiST-specific logic has been moved from
> match_pathkeys_to_index() to gistcanorderbyop().

I'm not very excited about this design of amcanorderbyop callback.
Introducing new callback from index access method to the planner should
imply quite good flexibility to the future. In this particular signature
of callback I see no potential future use-cases than your implementation
for btree. We could just add amcanorderbyonlyfisrtop property and that
would give us same level of flexibility I think.
With existing index types, we could cover much more orderings that we
currently do. Some of possible cases:
1) "ORDER BY col" for btree_gist, SP-GiST text_ops
2) "ORDER BY col1, col2 <-> const" for btree_gist
3) "ORDER BY col1, col2 <-> const" for btree

I understand that #3 is quite hard task and I don't ask you to implement it
now. But it would be nice if some day we decide to add #3, we wouldn't
have to change IndexAmRoutine definition.

My idea is that we need more general redesign of specifying ordering which
index can produce. Ideally, we should replace amcanorder, amcanbackward
and amcanorderbyop with single callback. Such callback should take a list
of pathkeys and return number of leading pathkeys index could satisfy (with
corresponding information for index scan). I'm not sure that other hackers
would agree with such design, but I'm very convinced that we need something
of this level of extendability. Otherwise we would have to hack our
planner <-> index_access_method interface each time we decide to cover
another index produced ordering.

6. Remove duplicate distance operators from contrib/btree_gist.
>
> References to their own distance operators in btree_gist opclasses are
> replaced with references to the built-in operators and than duplicate
> operators are dropped. But if the user is using somewhere these operators,
> upgrade of btree_gist from 1.3 to 1.4 would fail.

The query in "btree_gist--1.3--1.4.sql" which directly touches system
catalogue to update opfamilies looks too hackery. I think we shouldn't use
such queries until we have no other choice.
In this particular case we can update opfamilies using legal mechanism
"ALTER OPERATOR FAMILY name USING index_method ADD/DROP ... " (note that
operator name could be schema-qualified if needed). This way wouldn't be
that brief, but it is much more correct.

Also this like catch my eyes.

> info->amcanorderbyop = (void (*)()) amroutine->amcanorderbyop;

It's not necessary to use cast here. For instance, we don't use cast
for amcostestimate.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kuntal Ghosh 2017-02-16 13:11:07 Re: Passing query string to workers
Previous Message Thomas Munro 2017-02-16 12:49:59 Re: SERIALIZABLE with parallel query