Adversarial case for "many duplicates" nbtree split strategy in v12

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Adversarial case for "many duplicates" nbtree split strategy in v12
Date: 2019-07-02 22:51:34
Message-ID: CAH2-WznCNvhZpxa__GqAa1fgQ9uYdVc=_apArkW2nc-K3O7_NA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The single largest benefit of the v12 nbtree enhancement was its
adaptability with indexes where a portion of the key space contains
many duplicates. Successive page splits choose split points in a way
that leaves duplicates together on their own page, and eventually pack
pages full of duplicates.

I have thought up a specific case where the logic can be fooled into
consistently doing the wrong thing, leading to very poor space
utilization:

drop table if exists descnums;
create table descnums(nums int4);
create index bloat_desc on descnums (nums desc nulls first);
-- Fill first leaf page (leaf root page) with NULL values, to the
point where it almost splits:
insert into descnums select null from generate_series(0,400);
-- Insert integers, which will be treated as descending insertions within index:
insert into descnums select i from generate_series(0,10000) i;
-- Observe if we had 50:50 page splits here:
create extension if not exists pgstattuple;
select avg_leaf_density, leaf_fragmentation from pgstatindex('bloat_desc');

The final output looks like this:

avg_leaf_density | leaf_fragmentation
------------------+--------------------
1.83 | 99.88
(1 row)

Although the case is contrived, it is clearly not okay that this is
possible -- avg_leaf_density should be about 50 here, which is what
you'll see on v11. You'll also see an avg_leaf_density that's at least
50 if you vary any of the details. For example, if you remove "nulls
first" then you'll get an avg_leaf_density of ~50. Or, if you make the
index ASC then the avg_leaf_density is almost exactly 90 for the usual
reason (the NULLs won't "get in the way" of consistently getting
rightmost splits that way). Note that I've deliberately arranged for
the page splits to be as ineffective as possible by almost filling a
leaf page with NULLs, leaving a tiny gap for all future non-NULL
integer insertions.

This is another case where a bimodal distribution causes trouble when
combined with auto-incrementing insertions -- it is slightly similar
to the TPC-C issue that the v12 work fixed IMV. You could say that the
real root of the problem here is one of two things, depending on your
perspective:

1. Arguably, nbtsplitloc.c is already doing the right thing here, and
the root of the issue is that _bt_truncate() lacks any way of
generating a new high key that is "mid way between" the value NULL in
the lastleft tuple and the integer in the firstright tuple during the
first split. If _bt_truncate() created a "mid point value" of around
INT_MAX/2 for the new high key during the first split, then everything
would work out -- we wouldn't keep splitting the same leftmost page
again and again. The behavior would stabilize in the same way as it
does in the ASC + NULLS LAST case, without hurting any other case that
already works well. This isn't an academic point; we might actually
need to do that in order to be able to pack the leaf page 90% full
with DESC insertions, which ought to be a future goal for
nbtsplitloc.c. But that's clearly not in scope for v12.

2. The other way you could look at it (which is likely to be the basis
of my fix for v12) is that nbtsplitloc.c has been fooled into treating
page splits as "many duplicate" splits, when in fact there are not
that many duplicates involved -- there just appears to be many
duplicates because they're so unevenly distributed on the page. It
would be fine for it to be wrong if there was some way that successive
page splits could course correct (see point 1), but that isn't
possible here because of the skew -- we get stuck with the same lousy
choice of split point again and again. (There also wouldn't be a
problem if the integers values were random, since we'd have just one
or two uneven splits at the start.)

I've already written a rough patch that fixes the issue by taking this
second view of the problem. The patch makes nbtsplitloc.c more
skeptical about finishing with the "many duplicates" strategy,
avoiding the problem -- it can just fall back on a 50:50 page split
when it looks like this is happening (the related "single value"
strategy must already so something similar in _bt_strategy()).
Currently, it simply considers if the new item on the page has an
offset number immediately to the right of the split point indicated by
the "many duplicates" strategy. We look for it within ~10 offset
positions to the right, since that strongly suggests that there aren't
that many duplicates after all. I may make the check more careful
still, for example by performing additional comparisons on the page to
make sure that there are in fact very few distinct values on the whole
page.

My draft fix doesn't cause any regressions in any of my test cases --
the fix barely affects the splits chosen for my real-world test data,
and TPC test data. As far as I know, I already have a comprehensive
fix. I will need to think about it much more carefully before
proceeding, though.

Thoughts?
--
Peter Geoghegan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-07-03 00:01:23 Re: MSVC Build support with visual studio 2019
Previous Message Peter Eisentraut 2019-07-02 22:46:22 Re: Fix two issues after moving to unified logging system for command-line utils