Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence
Date: 2019-12-30 22:40:31
Message-ID: CAH2-WzmCU4hmuN9RT=5zjWg=FwGf5H2CiSp+Vb4M5Hm-5g5OzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 30, 2019 at 9:45 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > For example, float and numeric types are "never bitwise equal", while array,
> > text, and other container types are "maybe bitwise equal". An array of
> > integers
> > or text with C collation can be treated as bitwise equal attributes, and it
> > would be too harsh to restrict them from deduplication.

We might as well support container types (like array) in the first
Postgres version that has nbtree deduplication, I suppose. Even still,
I don't think that it actually matters much to users. B-Tree indexes
on arrays are probably very rare. Note that I don't consider text to
be a container type here -- obviously btree/text_ops is a very
important opclass for the deduplication feature. It may be the most
important opclass overall.

Recursively invoking a support function for the "contained" data type
in the btree/array_ops support function seems like it might be messy.
Not sure about that, though.

> > What bothers me is that this option will unlikely be helpful on its own
> > and we
> > should also provide some kind of recheck function along with opclass, which
> > complicates this idea even further and doesn't seem very clear.
>
> It seems like the simplest thing might be to forget about the 'char'
> column and just have a support function which can be used to assess
> whether a given opclass's notion of equality is bitwise.

I like the idea of relying only on a support function.

This approach makes collations a problem that the opclass author has
to deal with directly, as is the case within a SortSupport support
function. Also seems like it would make life easier for third party
data types that want to make use of these optimizations (if in fact
there are any).

I also see little downside to this approach. The extra cycles
shouldn't be noticeable. As far as the B-Tree deduplication logic is
concerned, the final boolean value (is deduplication safe?) comes from
the index metapage -- we pass that down through an insertion scankey.
We only need to determine whether or not the optimization is safe at
CREATE INDEX time. (Actually, I don't want to commit to the idea that
nbtree should only call this support function at CREATE INDEX time
right now. I'm sure that it will hardly ever need to be called,
though.)

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vik Fearing 2019-12-30 23:07:55 Re: Allow an alias to be attached directly to a JOIN ... USING
Previous Message Tomas Vondra 2019-12-30 21:33:42 Re: [PATCH] Increase the maximum value track_activity_query_size