Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation
Date: 2021-04-23 10:45:45
Message-ID: CAEze2WiCBhxa8kWbO3sHS0T9TsLZAwsjBiuUJApv7md73+zWhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 17 Apr 2021 at 01:05, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Fri, Apr 16, 2021 at 2:20 PM Matthias van de Meent
> <boekewurm+postgres(at)gmail(dot)com> wrote:
> > > Interesting approach. I think that in an ideal world we would have a
> > > tuple format with attribute lengths/offsets right in the header.
> >
> > I believe that that would indeed be ideal w.r.t. access speed, but
> > also quite expensive w.r.t. amount of data stored. This would add 2
> > bytes per attribute in the current infrastructure (11 bits at least
> > for each attribute to store offsets), on the current 12 bytes of
> > overhead per indextuple (= 8 for IndexTuple + 4 for ItemId). That is
> > probably always going to be a non-starter, seeing that we can
> > relatively easily optimize our current attribute access patterns.
>
> I don't think that that's why it's a non-starter. This design assumes
> a world in which everything has already been optimized for this
> layout. You no longer get to store the varlena header inline, which
> would break a lot of things in Postgres were it ever to be attempted.
> The space efficiency issues don't really apply because you have an
> offset for fixed-length types -- their presence is always implied. I
> think that you need to encode NULLs differently, which is a lot less
> space efficient when there are a lot of NULLs. But on the whole this
> design seems more efficient than what we have currently.

I believe that that depends on your definition of 'efficiency'. For
storage efficiency, the current design is quite good (except for the
varlena header size of 4 bytes for attributes > 127 bytes, which could
be 2 bytes because pages can not be larger than 64kiB (actually 32kiB)
with our current design, all attributes use just about the least data
possible). For access efficiency / code complexity, you're probably
right that storing attribute offsets in the tuple header is
preferable, but such design would still need some alignment calls, or
store the length of attributes as well to prevent reading the
alignment padding of the next attribute into the variable length
attribute at the additional overhead of up to 2 bytes per attribute.

> This representation of index tuples would be a totally reasonable
> design were we in a green field situation. (Which is pretty far from
> the situation we're actually in, of course.)

That might indeed be the case, assuming a green field with different
or no processing architecture or storage limitations. CPU to storage
bandwidth can be (and often is) a bottleneck, as well as alignment.

> > I understand and appreciate that the "-INF" truncation that is
> > currently in place is being relied upon in quite some places. Part of
> > the effort for "+INF" truncation would be determining where and how to
> > keep the benefits of the "-INF" truncation. I also believe that for
> > internal pages truncating to "+INF" would be perfectly fine; the
> > optimizations that I know of only rely on it at the leaf level.
>
> I don't doubt that there is nothing special about -inf from a key
> space point of view. Actually...you could say that -inf is special to
> the limited extent that we know it only appears in pivot tuples and
> exploit that property when the !pivotsearch case/optimization is used.
> But that isn't much of an exception at a high level, so whatever.
>
> Anyway, it follows that +inf could in principle be used instead in
> some or all cases -- all that is truly essential for correctness is
> that the invariants always be respected. We're still in agreement up
> until here.

Agreed

> > Yes, I also read and appreciate your comments on +inf vs -inf when
> > this came up in [0].
>
> I'm impressed that you've done your homework on this.
>
> > However, if we could choose, I think that having
> > both options could be quite beneficial, especially when dealing with
> > many duplicates or duplicate prefixes.
>
> This is where things are much less clear -- maybe we're not in
> agreement here. Who knows, though -- maybe you're right. But you
> haven't presented any kind of argument. I understand that it's hard to
> articulate what effects might be in play with stuff like this, so I
> won't force the issue now. Strong evidence is of course the only way
> that you'll reliably convince me of this.
>
> I should point out that I am a little confused about how this +inf
> business could be both independently useful and pivotal to
> implementing [dynamic] prefix truncation/compression. Seems...weird to
> discuss them together, except maybe to mention in passing that this
> +inf thing is notable for particularly helping dynamic prefix stuff --
> which is it?

I agree that my reasoning might have been unclear and confusing.

I mean that most benefits that we could receive from +inf would be in
improving the ability to apply [dynamic] prefix truncation on a page
by limiting the keyspace of that page to 'local' values. If prefix
truncation is impossible / does not apply for some index (a single
unique column !allequalimage index is a likely worst case scenario),
then applying +inf would potentially be detrimental to the performance
of certain other optimizations (e.g. the continuescan optimization),
in which case using -inf would probably be preferable. Ergo, I'm
planning on making _bt_recsplitloc aware of +inf and -inf after
implementing physical prefix truncation, and allow it to decide if and
when each should be applied, if it turns out it consistently improves
space and/or time performance without significantly decreasing either.

> It is my strong preference that nbtsplitloc.c continue to know
> approximately nothing about compression or deduplication. While it is
> true that nbtsplitloc.c's _bt_recsplitloc() is aware of posting lists,
> this is strictly an optimization that is only justified by the fact
> that posting lists are sometimes very large, and therefore worth
> considering directly -- just to get a more accurate idea of how a
> relevant split point choice affects the balance of free space (we
> don't bother to do the same thing with non-key INCLUDE columns because
> they're generally small and equi-sized). And so this _bt_recsplitloc()
> thing no exception to the general rule, which is:
> deduplication/posting list maintenance should be *totally* orthogonal
> to the page split choice logic (the design of posting list splits
> helps a lot with that). We can afford to have complicated split point
> choice logic because the question of which split point is optimal is
> totally decoupled from the question of which are correct -- in
> particular, from the correctness of the space accounting used to
> generate candidate split points.

I would argue that it also knows about duplicate attributes and shared
prefixes? It already optimizes (unintentionally?) for deduplication by
choosing split points between two runs of equal values. I believe that
implementing the same for prefixes (if not already in place) would not
stand out too much. I think we can discuss that more extensively when
we actually have code that would benefit from that.

> It may interest you to know that I once thought that it would be nice
> to have the *option* of +inf too, so that we could use it in very rare
> cases like the pathological SPLIT_MANY_DUPLICATES case that
> _bt_bestsplitloc() has some defenses against. It would perhaps be nice
> if we could use +inf selectively in that case. I never said anything
> about this publicly before now, mostly because it wasn't that
> important -- pathological behaviors like this have never been reported
> on by users a full 18 months after the release of 12.0, so it's
> unlikely to be a real concern.

I do not per se disagree, but I should note that the amazing work on
btree page split prevention through 'heapkeyspace', deduplication and
eager tuple deletion have changed some key behaviours of btree index
pages. The same would likely occur once physical prefix truncation is
implemented, and in that case I believe that some decisions that were
previously non-problematic might need to be re-examined.

> > > If you're going to pursue full prefix compression anyway, maybe you
> > > should use a low key on the leaf level in cases where the optimization
> > > is in use. This creates complexity during page deletion, because the
> > > low key in the subtree to the right of the deletion target subtree may
> > > need to be updated. Perhaps you can find a way to make that work that
> > > isn't too complicated.
> >
> > That would be an interesting research path as well, the cost/benefit
> > analysis would be much trickier when comparing to the status quo.
>
> I'd say that's unclear right now.

I agree. My 'trickier' pointed to that "adding an extra non-key tuple
to the page" needs solid understanding and reasoning about the use of
the AM to prove that it's worth the extra metadata on the page.
Proving that is, in my opinion, difficult.

> > > You should probably account for index size here. I have lots of my own
> > > tests for space utilization, using data from a variety of sources.
> >
> > I'd like to mention that the current (and measured) patchset only does
> > _logical_ dynamic prefix truncation, not the physical prefix
> > truncation that is described on the wiki page.
>
> If you change how _bt_truncate() behaves in any way (e.g. sometimes
> it's lastleft/+inf based now), and nothing else, you're still bound to
> change the space utilization with the tests that I maintain -- though
> perhaps only at the level of noise. I sometimes call these tests "wind
> tunnel tests". It turns out that you can simulate rather a lot about a
> real complicated workload with simple, deterministic, serial test
> cases -- provided you're only interested in the space utilization.
> This helped a lot for both the Postgres 12 and Postgres 13 stuff
> (though not the Postgres 14 stuff).

I would be interested in running these benchmarks when I get to
updating the physical format. Good to know there are relatively easy
tests available.

> > These patches are
> > useful on their own for multi-key-column btree performance (and some
> > GIST), regardless of later patches implementing physical dynamic
> > prefix truncation in the btree AM.
>
> Have you isolated the performance impact of the first patch at all?
> Can you characterize how well it works on its own, perhaps just
> informally?

The REINDEX performance results is the place where attribute iteration
shines best, as the hot path in reindex is the tuple comparison which
used index_getattr a lot in its hot path, and the dynamic prefix
truncation is not applicable there (yet?). Its time spent went down by
over 6% for the indexes with 3 key columns of variable length, whereas
the indexes with only a single fixed-size attribute took only slightly
longer (+0.6% avg in 3 runs on a laptop, high variance). I have not
tested it with GIST, but I believe that similar results are realistic
there as well for varlen attributes.

> It would be convenient if the first patch could be treated
> as an independent thing.

Patch 0002 was the reason for writing 0001, and uses the performance
improvements of 0001 to show it's worth. As such, I submitted them as
a set. If you'd like, I could submit 0002 seperately?

With regards,

Matthias van de Meent

[+] instead of starting _binsrch with only the high key compare
result, we could also eagerly compare the search key to the lowest
key. This way, we have high+low bounds for the whole page, instead of
having that only after finding a key < searchkey on the page. The
effort might just as well not be worth it, as it is one extra key
compare (out of max 9 on a page, plus one highkey).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-04-23 10:56:32 Re: TRUNCATE on foreign table
Previous Message Fujii Masao 2021-04-23 10:40:33 Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?