Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-07-11 17:42:55
Message-ID: CAH2-Wzmo95d87N5eXn28OORCg4kJ4tqO7qSg1JxYyvF__gjyRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 11, 2019 at 8:02 AM Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> Could you please elaborate more on preserving the logical contents? I
> can understand it as following: "B-Tree should have the same structure
> and invariants as if each TID in posting list be a separate tuple".

That's exactly what I mean.

> So, if we imagine each TID to become separate tuple it would be the
> same B-tree, which just can magically sometimes store more tuples in
> page. Is my understanding correct?

Yes.

> But outside code will still
> notice changes as soon as it directly accesses B-tree pages (like
> contrib/amcheck does). Do you mean we need an API for accessing
> logical B-tree tuples or something?

Well, contrib/amcheck isn't really outside code. But amcheck's
"rootdescend" option will still need to be able to supply a heap TID
as just another column, and get back zero or one logical tuples from
the index. This is important because retail index tuple deletion needs
to be able to think about logical tuples in the same way. I also think
that it might be useful for the planner to expect to get back
duplicates in heap TID order in the future (or in reverse order in the
case of a backwards scan). Query execution and VACUUM code outside of
nbtree should be able to pretend that there is no such thing as a
posting list.

The main thing that the patch is missing that is needed to "preserve
logical contents" is the ability to update/expand an *existing*
posting list due to a retail insertion of a new duplicate that happens
to be within the range of that existing posting list. This will
usually be a non-HOT update that doesn't change the value for the row
in the index -- that must change the posting list, even when there is
available space on the page without recompressing. We must still
occasionally be eager, like GIN always is, though in practice we'll
almost always add to posting lists in a lazy fashion, when it looks
like we might have to split the page -- the lazy approach seems to
perform best.

> I think in order to deduplicate "equal but distinct" values we need at
> least to give up with index only scans. Because we have no
> restriction that equal according to B-tree opclass values are same for
> other operations and/or user output.

We can either prevent index-only scans in the case of affected
indexes, or prevent compression, or give the user a choice. I'm not
too worried about how that will work for users just yet.

> Do I understand correctly that current patch may produce posting lists
> of the same value with overlapping ranges of TIDs? If so, it's
> definitely wrong.

Yes, it can, since the assertion fails. It looks like the assertion
itself was changed to match what I expect, so I assume that this bug
will be fixed in the next version of the patch. It fails with a fairly
big index on text for me.

> > * Maybe we could do compression with unique indexes when inserting
> > values with NULLs? Note that we now treat an insertion of a tuple with
> > NULLs into a unique index as if it wasn't even a unique index -- see
> > the "checkingunique" optimization at the beginning of _bt_doinsert().
> > Having many NULL values in a unique index is probably fairly common.
>
> I think unique indexes may benefit from deduplication not only because
> of NULL values. Non-HOT updates produce duplicates of non-NULL values
> in unique indexes. And those duplicates can take significant space.

I agree that we should definitely have an open mind about unique
indexes, even with non-NULL values. If we can prevent a page split by
deduplicating the contents of a unique index page, then we'll probably
win. Why not try? This will need to be tested.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Belyavsky 2019-07-11 17:49:41 Re: Ltree syntax improvement
Previous Message Robert Haas 2019-07-11 17:27:11 Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)