Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2020-01-08 22:56:22
Message-ID: CAH2-WzmZGO-NpSCu33tsoZcOskS8UbTUdd7e9V2kJA-kDusWaA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 8, 2020 at 5:56 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> On 04/01/2020 03:47, Peter Geoghegan wrote:
> > Attached is v28, which fixes bitrot from my recent commits to refactor
> > VACUUM-related code in nbtpage.c.
>
> I started to read through this gigantic patch.

Oh come on, it's not that big. :-)

> I got about 1/3 way
> through. I wrote minor comments directly in the attached patch file,
> search for "HEIKKI:". I wrote them as I read the patch from beginning to
> end, so it's possible that some of my questions are answered later in
> the patch. I didn't have the stamina to read through the whole patch
> yet, I'll continue later.

Thanks for the review! Anything that you've written that I do not
respond to directly can be assumed to have been accepted by me.

I'll start with responses to the points that you raise in your patch
that need a response

Patch comments
==============

* Furthermore, deduplication can be turned on or off as needed, or
applied HEIKKI: When would it be needed?

I believe that hardly anybody will want to turn off deduplication in
practice. My point here is that we're flexible -- we're not
maintaining posting lists like GIN. We're just deduplicating as and
when needed. We can change our preference about that any time. Turning
off deduplication won't magically undo past deduplications, of course,
but everything mostly works in the same way when deduplication is on
or off. We're just talking about an alternative physical
representation of the same logical contents.

* HEIKKI: How do LP_DEAD work on posting list tuples?

Same as before, except that it applies to all TIDs in the tuple
together (will mention this in commit message, though). Note that the
fact that we delay deduplication also means that we delay merging the
LP_DEAD bits. And we always prefer to remove LP_DEAD items. Finally,
we refuse to do a posting list split when its LP_DEAD bit is set, so
it's now possible to delete LP_DEAD bit set tuples a little early,
before a page split has to be avoided -- see the new code and comments
at the end of _bt_findinsertloc().

See also: my later response to your e-mail remarks on LP_DEAD bits,
unique indexes, and space accounting.

* HEIKKI: When is it [deduplication] not safe?

With opclasses like btree/numeric_ops, where display scale messes
things up. See this thread for more information on the infrastructure
that we need for that:

https://www.postgresql.org/message-id/flat/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ(at)mail(dot)gmail(dot)com

* HEIKKI: Why is it safe to read on version 3 indexes? Because unused
space is set to zeros?

Yes. Same applies to version 4 indexes that come from Postgres 12 --
users must REINDEX to call _bt_opclasses_support_dedup() and set
metapage field, but we can rely on the new field being all zeroes
before that happens. (It would be possible to teach pg_upgrade to set
the field for compatible indexes from Postgres 12, but I don't want to
bother with that. We probably cannot safely call
_bt_opclasses_support_dedup() with a buffer lock held, so that seems
like the only way.)

* HEIKKI: Do we need it as a separate flag, isn't it always safe with
version 4 indexes, and never with version 3?

No, it isn't *always* safe with version 4 indexes, for reasons that
have nothing to do with the on-disk representation (like the display
scale issue, nondeterministic collations, etc). It really is a
distinct condition. (Deduplication is never safe with version 3
indexes, obviously.)

It occurs to me now that we probably don't even want to make the
metapage field about deduplication (though that's what it says right
now). Rather, it should be about supporting a general category of
optimizations that include deduplication, and might also include
prefix compression in the future. Note that whether or not we should
actually apply these optimizations is always a separate question.

* + * Non-pivot tuples complement pivot tuples, which only have key
columns. HEIKKI: What does it mean that they complement pivot
tuples?

It means that all tuples are either pivot tuples, or are non-pivot tuples.

* + * safely (index storage parameter separately indicates if
deduplication is HEIKKI: Is there really an "index storage
parameter" for that? What is that, something in the WITH clause?

Yes, there is actually an index storage parameter named
"deduplication" (something in the WITH clause). This is deliberately
not named "btree_deduplication", the current name of the GUC. This
exists to make the optimization controllable at the index level.
(Though I should probably mention the GUC first in this code comment,
or not even mention the less significant storage parameter.)

* HEIKKI: How much memory does this [BTScanPosData.items array of
width MaxBTreeIndexTuplesPerPage] need now? Should we consider
pallocing this separately?

But BTScanPosData isn't allocated on the stack anyway.

* HEIKKI: Would it be more clear to have a separate struct for the
posting list split case? (i.e. don't reuse xl_btree_insert)

I doubt it, but I'm open to it. We don't do it that way in a number of
existing cases.

* HEIKKI: Do we only generate one posting list in one WAL record? I
would assume it's better to deduplicate everything on the page, since
we're modifying it anyway.

You might be right about that. Let me get back to you on that.

HEIKKI: Does this [xl_btree_vacuum WAL record] store a whole copy of
the remaining posting list on an updated tuple? Wouldn't it be simpler
and more space-efficient to store just the deleted TIDs?

It would certainly be more space efficient in cases where we delete
some but not all TIDs -- hard to know how much that matters. Don't
think that it would be simpler, though.

I have an open mind about this. I can try it the other way if you like.

* HEIKKI: Do we ever do that? Do we ever set the LP_DEAD bit on a
posting list tuple?

As I said, we are able to set LP_DEAD bits on posting list tuples, if
and only if all the TIDs are dead (i.e. if all-but-one TID is dead, it
cannot be set). This limitation does not seem to matter in practice,
in part because LP_DEAD bits can be set before we deduplicate --
that's another benefit of delaying deduplication until the point where
we'd usually have to split the page.

See also: my later response to your e-mail remarks on LP_DEAD bits,
unique indexes, and space accounting.

* HEIKKI: Well, it's optimized for that today, but if it [a posting
list] was compressed, a btree would be useful in more situations...

I agree, but I think that we should do compression by inventing a new
type of leaf page that only stores TIDs, and use that when we do a
single value mode split in nbtsplitloc.c. So we don't even use tuples
at that point (except the high key), and we compress the entire page.
That way, we don't have to worry about posting list splits and stuff
like that, which seems like the best of both worlds. Maybe we can use
a true bitmap on these special leaf pages.

... Now to answer the feedback from your actual e-mail ...

E-mail
======

> One major design question here is about the LP_DEAD tuples. There's
> quite a lot of logic and heuristics and explanations related to unique
> indexes.

The unique index stuff hasn't really been discussed on the thread
until now. Those parts are all my work.

> To make them behave differently from non-unique indexes, to
> keep the LP_DEAD optimization effective. What if we had a separate
> LP_DEAD flag for every item in a posting list, instead? I think we
> wouldn't need to treat unique indexes differently from non-unique
> indexes, then.

I don't think that that's quite true -- it's not so much about LP_DEAD
bits as it is about our *goals* with unique indexes. We have no reason
to deduplicate other than to delay an immediate page split, so it
isn't really about space efficiency. Having individual LP_DEAD bits
for each TID wouldn't change the picture for _bt_dedup_one_page() -- I
would still want a checkingunique flag there. But individual LP_DEAD
bits would make a lot of other things much more complicated. Unique
indexes are kind of special, in general.

The thing that I prioritized keeping simple in the patch is page space
accounting, particularly the nbtsplitloc.c logic, which doesn't need
any changes to continue to work (it's also important for page space
accounting to be simple within _bt_dedup_one_page()). I did teach
nbtsplitloc.c to take posting lists from the firstright tuple into
account, but only because they're often unusually large, making it a
worthwhile optimization. Exactly the same thing could take place with
non-key INCLUDE columns, but typically the extra payload is not very
large, so I haven't bothered with that before now.

If you had a "supplemental header" to store per-TID LP_DEAD bits, that
would make things complicated for page space accounting. Even if it
was only one byte, you'd have to worry about it taking up an entire
extra MAXALIGN() quantum within _bt_dedup_one_page(). And then there
is the complexity within _bt_killitems(), needed to make the
kill_prior_tuple stuff work. I might actually try to do it that way if
I thought that it would perform better, or be simpler than what I came
up with. I doubt that, though.

In summary: while it would be possible to have per-TID LP_DEAD bits,
but I don't think it would be even remotely worth it. I can go into my
doubts about the performance benefits if you want.

Note also: I tend to think of the LP_DEAD bit setting within
_bt_check_unique() as almost a separate optimization to the
kill_prior_tuple stuff, even though they both involve LP_DEAD bits.
The former is much more important than the latter. The
kill_prior_tuple thing was severely regressed in Postgres 9.5 without
anyone really noticing [1].

> Another important decision here is the on-disk format of these tuples.
> The format of IndexTuples on a b-tree page has become really
> complicated. The v12 changes to store TIDs in order did a lot of that,
> but this makes it even more complicated.

It adds two new functions: BTreeTupleIsPivot(), and
BTreeTupleIsPosting(). This means that there are three basic kinds of
B-Tree tuple layout. We can detect which kind any given tuple is in a
low context way. The three possible cases are:

* Pivot tuples.

* Posting list tuples (non-pivot tuples that have at least two head TIDs).

* Regular/small non-pivot tuples -- this representation has never
changed in all the time I've worked on Postgres.

You'll notice that there are lots of new assertions, including in
places that don't have anything to do with the new code --
BTreeTupleIsPivot() and BTreeTupleIsPosting() assertions.

I think that there is only really one wart here that tends to come up
outside the nbtree.h code itself again and again: the fact that
!heapkeyspace indexes may give false negatives when
BTreeTupleIsPivot() is used. So any BTreeTupleIsPivot() assertion has
to include some nearby heapkeyspace field to cover that case (or else
make sure that the index is a v4+/heapkeyspace index in some other
way). Note, however, that we can safely assert !BTreeTupleIsPivot() --
that won't produce spurious assertion failures with heapkeyspace
indexes. Note also that the new BTreeTupleIsPosting() function works
reliably on all B-Tree versions.

The only future requirements that I can anticipate for the tuple format in are:

1. The need to support wider TIDs. (I am strongly of the opinion that
this shouldn't work all that differently to what we have now.)

2. The need for a page-level prefix compression feature. This can work
by requiring decompression code to assume that the common prefix for
the page just isn't present.

This seems doable within the confines of the current/proposed B-Tree
tuple format. Though we still need to have a serious discussion about
the future of TIDs in light of stuff like ZedStore. I think that fully
logical table identifiers are worth supporting, but they had better
behave pretty much like a TID within index access method code -- they
better show temporal and spatial locality in about the same way TIDs
do. They should be compared as generic integers, and accept reasonable
limits on TID width. It should be possible to do cheap binary searches
on posting lists in about the same way.

> I know there are strong
> backwards-compatibility reasons for the current format, but
> nevertheless, if we were to design this from scratch, what would the
> B-tree page and tuple format be like?

That's a good question, but my answer depends on the scope of the question.

If you define "from scratch" to mean "5 years ago", then I believe
that it would be exactly the same as what we have now. I specifically
anticipated the need to have posting list TIDs (existing v12 era
comments in nbtree.h and amcheck things about posting lists). And what
I came up with is almost the same as the GIN format, except that we
have explicit pivot tuples (to make suffix truncation work), and use
the 13th IndexTupleData header bit (INDEX_ALT_TID_MASK) in a way that
makes it possible to store non-pivot tuples in a space-efficient way
when they are all unique. A plain GIN tuple used an extra MAXALIGN()
quantum to store an entry tree tuple that only has one TID.

If, on the other hand, you're talking about a totally green field
situation, then I would probably not use IndexTuple at all. I think
that a representation that stores offsets right in the tuple header
(so no separate varlena headers) has more advantages than
disadvantages. It would make it easier to do both suffix truncation
and prefix compression. It also makes it cheap to skip to the end of
the tuple. In general, it would be nice if the IndexTupleData TID was
less special, but that assumption is baked into a lot of code -- most
of which is technically not in nbtree.

We expect very specific things about the alignment of TIDs -- they are
assumed to be 3 SHORTALIGN()'d uint16 fields. Within nbtree, we assume
SHORTALIGN()'d access to the t_info field by IndexTupleSize() will be
okay within btree_xlog_split(). I bet that there are a number of
subtle assumptions about our use of IndexTupleData + ItemPointerData
that we have no idea about right now. So changing it won't be that
easy.

As for page level stuff, I believe that we mostly do things the right
way already. I would prefer it if the line pointer array was at the
end of the page so that tuples could go at the start of the page, and
be appending front to back (maybe the special area would still be at
the end). That's a very general issue, though -- Andres says that that
would help branch prediction, though I'm not sure of the details
offhand.

Questions
=========

Finally, I have some specific questions for you about the patch:

1. How do you feel about the design of posting list splits, and my
explanation of that design in the nbtree README?

2. How do you feel about the idea of stopping VACUUM from clearing the
BTP_HAS_GARBAGE page level flag?

I suspect that it's much better to have it falsely set than to have it
falsely unset. The extra cost is that we do a useless extra call to
_bt_vacuum_one_page(), but that's very cheap in the context of having
to deal with a page that's full, that we might have to split (or
deduplicate) anyway. But the extra benefit could perhaps be quite
large. This question doesn't really have that much to do with
deduplication.

[1] https://www.postgresql.org/message-id/flat/CAH2-Wz%3DSfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA%40mail.gmail.com#b20ead9675225f12b6a80e53e19eed9d
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2020-01-08 22:57:53 Re: our checks for read-only queries are not great
Previous Message Robert Haas 2020-01-08 22:49:06 Re: Removing pg_pltemplate and creating "trustable" extensions