Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Subject: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence
Date: 2019-08-25 20:29:09
Message-ID: CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Anastasia's nbtree deduplication patch [1] has an open problem that I
would like to find a solution for: it currently assumes that there is
no difference between binary equality and opclass equality. That won't
work for opclasses such as btree/numeric, because compressing equal
numeric datums could destroy display scale if equal numeric datums
were naively lumped together (actually, the deduplication patch
doesn't work like that, but it has other subtle problems due to not
having worked out fundamental definitional issues).

We don't need to be able to assume that binary equality is exactly the
same thing as opclass equality at the level of individual tuples. We
only need to be able to assume that the user cannot observe any
differences when they are shown output for two datums that are
opclass-equal for any opclass that supports deduplication (i.e. cases
like the numeric_ops case just won't work, so we shouldn't event try).
I believe that it would be okay if we treated two IndexTuples as
equivalent and therefore targets to store together in the same posting
list when they happen to have distinct binary representations due to
the original datums having different TOAST input state. In short, the
deduplication patch cannot tolerate being unable to store
opclass-equal IndexTuples in the same posting list when the opclass
(or the underlying type being indexed) somehow allows that equality
isn't equivalence -- that's simply unsupportable. The opclass gets one
chance to say whether or not it vetoes the use of deduplication: at
CREATE INDEX time.

Consumers of this new infrastructure probably won't be limited to the
deduplication feature; the same infrastructure will be needed for a
B-Tree prefix compression patch (I'm thinking of a configurable CREATE
INDEX prefix compression feature). GIN never had to solve this problem
because its indexes are always lossy, and cannot support index-only
scans. It seems likely that a scheme like the one I have in mind can
work for the vast majority of Postgres B-Tree indexes in practice, so
I don't think that the user-visible restrictions I'm considering will
make the patch significantly less useful (it's already very useful).
The most notable restriction for users will almost certainly be not
supporting deduplication within indexes that use nondeterministic
collations. They were already paying a performance penalty during
hashing, though.

I would like to:

* Get some buy-in on whether or not the precise distinctions I would
like to make are correct for deduplication in particular, and as
useful as possible for other cases that we may need to add later on.

* Figure out the exact interface through which opclass/opfamily
authors can represent that their notion of equality is compatible with
deduplication/compression. (I think that the use of nondeterministic
collations should disable deduplication without explicit action from
the operator class -- that should just be baked in.)

* Mark most existing btree operator classes as being compatible with
deduplication as part of making the patch committable. As I said, I
believe that their semantics are already compatible with what we need
for deduplication to work sensibly, aside from a handful of specific
exceptions.

In any case, I'm certain that problems like the btree/numeric display
scale problem are simply not worth solving directly. That would add a
huge amount of complexity for very little benefit.

[1] https://commitfest.postgresql.org/24/2202/
--
Peter Geoghegan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-08-25 20:30:38 Re: Statement timeout in pg_rewind
Previous Message Tom Lane 2019-08-25 20:23:34 Re: mingw32 floating point diff