Re: POC, WIP: OR-clause support for indexes

From: "Finnerty, Jim" <jfinnert(at)amazon(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru>
Cc: Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "teodor(at)sigaev(dot)ru" <teodor(at)sigaev(dot)ru>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2023-08-01 20:11:07
Message-ID: 4002251A-F36D-4148-A258-548C982397E2@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Peter, I'm very glad to hear that you're researching this!

Will this include skip-scan optimizations for OR or IN predicates, or when the number of distinct values in a leading non-constant index column(s) is sufficiently small? e.g. suppose there is an ORDER BY b, and WHERE clause predicates (a = 1 AND b = 5) OR (c > 12 AND b BETWEEN 100 AND 200). Then a single index scan on an index with leading column b could visit b = 5, and then the range b from 100:200, and deliver the rows in the order requested. Or if the predicate is (a = 1 AND b = 5) OR (c LIKE 'XYZ' AND b < 12), then you can scan just b < 12. Or if the index is defined on (a, b) and you know that b = 100, and that there are only 4 distinct values of column a, then you could skip each distinct value of a where b = 100, and so on.

If you have an ORDER BY clause and a lower and upper bound on the first column of the ORDER BY list, you have a potential to reduce search effort versus a full index scan, even when that upper and lower bound needs to be derived from a complex predicate.

Of course, if you have an IN list you can either skip to the distinct values listed or scan the entire index, depending on estimated cost.

/Jim F

On 8/1/23, 3:43 PM, "Peter Geoghegan" <pg(at)bowt(dot)ie <mailto:pg(at)bowt(dot)ie>> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Mon, Jul 31, 2023 at 9:38 AM Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru <mailto:lena(dot)ribackina(at)yandex(dot)ru>> wrote:
> I noticed only one thing there: when we have unsorted array values in
> SOAP, the query takes longer than
> when it has a sorted array. I'll double-check it just in case and write
> about the results later.

I would expect the B-Tree preprocessing by _bt_preprocess_array_keys()
to be very slightly faster when the query is written with presorted,
duplicate-free constants. Sorting is faster when you don't really have
to sort. However, I would not expect the effect to be significant
enough to matter, except perhaps in very extreme cases.
Although...some of the cases you care about are very extreme cases.

> I am also testing some experience with multi-column indexes using SAOPs.

Have you thought about a similar transformation for when the row
constructor syntax happens to have been used?

Consider a query like the following, against a table with a composite
index on (a, b):

select * from multi_test where ( a, b ) in (( 1, 1 ), ( 2, 1 ));

This query will get a BitmapOr based plan that's similar to the plans
that OR-based queries affected by your transformation patch get today,
on HEAD. However, this equivalent spelling has the potential to be
significantly faster:

select * from multi_test where a = any('{1,2}') and b = 1;

(Of course, this is more likely to be true with my nbtree SAOP patch in place.)

Note that we currently won't use RowCompareExpr in many simple cases
where the row constructor syntax has been used. For example, a query
like this:

select * from multi_test where ( a, b ) = (( 2, 1 ));

This case already involves a transformation that is roughly comparable
to the one you're working on now. We'll remove the RowCompareExpr
during parsing. It'll be as if my example row constructor equality
query was written this way instead:

select * from multi_test where a = 2 and b = 1;

This can be surprisingly important, when combined with other things,
in more realistic examples.

The nbtree code has special knowledge of RowCompareExpr that makes the
rules for comparing index tuples different to those from other kinds
of index scans. However, due to the RowCompareExpr transformation
process I just described, we don't need to rely on that specialized
nbtree code when the row constructor syntax is used with a simple
equality clause -- which is what makes the normalization process have
real value. If the nbtree code cannot see RowCompareExpr index quals
then it cannot have this problem in the first place. In general it is
useful to "normalize to conjunctive normal form" when it might allow
scan key preprocessing in the nbtree code to come up with a much
faster approach to executing the scan.

It's easier to understand what I mean by showing a simple example. The
nbtree preprocessing code is smart enough to recognize that the
following query doesn't really need to do any work, due to having
quals that it recognizes as contradictory (it can set so->qual_okay to
false for unsatisfiable quals):

select * from multi_test where ( a, b ) = (( 2, 1 )) and a = -1;

However, it is not smart enough to perform the same trick if we change
one small detail with the query:

select * from multi_test where ( a, b ) >= (( 2, 1 )) and a = -1;

Ideally, the optimizer would canonicalize/normalize everything in a
way that made all of the nbtree preprocessing optimizations work just
as well, without introducing any new special cases. Obviously, there
is no reason why we can't perform the same trick with the second
variant. (Note also that the nbtree preprocessing code can be smart
about redundant quals, not just contradictory quals, so it matters
more than it may appear from this simple, unrealistic example of
mine.)

While these similar RowCompareExpr transformations are at least
somewhat important, that's not really why I bring them up now. I am
pointing them out now because I think that it might help you to
develop a more complete mental model of these transformations.
Ideally, your initial approach will generalize to other situations
later on. So it's worth considering the relationship between this
existing RowCompareExpr transformation, and the one that you're
working on currently. Plus other, future transformations.

This example might also give you some appreciation of why my SAOP
patch is confused about when we need to do normalization/safety
checks. Some things seem necessary when generating index paths in the
optimizer. Other things seem necessary during preprocessing, in the
nbtree code, at the start of the index scan. Unfortunately, it's not
obvious to me where the right place is to deal with each aspect of
setting up multi-column SAOP index quals. My mental model is very
incomplete.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2023-08-01 20:24:54 Re: document the need to analyze partitioned tables
Previous Message Jonah H. Harris 2023-08-01 20:07:24 Re: How to build a new grammer for pg?