Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-09-11 21:04:59
Message-ID: CAH2-Wz=VyPOXA_9n4goZ7U-jYjO9icnu4GEc0_9Y_8g7S_zx9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova
<a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
> I reviewed them and everything looks good except the idea of not
> splitting dead posting tuples.
> According to comments to scan->ignore_killed_tuples in genam.c:107,
> it may lead to incorrect tuple order on a replica.
> I don't sure, if it leads to any real problem, though, or it will be
> resolved
> by subsequent visibility checks.

Fair enough, but I didn't do that because it's compelling on its own
-- it isn't. I did it because it seemed like the best way to handle
posting list splits in a version of the patch where LP_DEAD bits can
be set on posting list tuples. I think that we have 3 high level
options here:

1. We don't support kill_prior_tuple/LP_DEAD bit setting with posting
lists at all. This is clearly the easiest approach.

2. We do what I did in v11 of the patch -- we make it so that
_bt_insertonpg() and _bt_split() never have to deal with LP_DEAD
posting lists that they must split in passing.

3. We add additional code to _bt_insertonpg() and _bt_split() to deal
with the rare case where they must split an LP_DEAD posting list,
probably by unsetting the bit or something like that. Obviously it
would be wrong to leave the LP_DEAD bit set for the newly inserted
heap tuples TID that must go in a posting list that had its LP_DEAD
bit set -- that would make it dead to index scans even after its xact
successfully committed.

I think that you already agree that we want to have the
kill_prior_tuple optimizations with posting lists, so #1 isn't really
an option. That just leaves #2 and #3. Since posting list splits are
already assumed to be quite rare, it seemed far simpler to take the
conservative approach of forcing clean-up that removes LP_DEAD bits so
that _bt_insertonpg() and _bt_split() don't have to think about it.
Obviously I think it's important that we make as few changes as
possible to _bt_insertonpg() and _bt_split(), in general.

I don't understand what you mean about visibility checks. There is
nothing truly special about the way in which _bt_findinsertloc() will
sometimes have to kill LP_DEAD items so that _bt_insertonpg() and
_bt_split() don't have to think about LP_DEAD posting lists. As far as
recovery is concerned, it is just another XLOG_BTREE_DELETE record,
like any other. Note that there is a second call to
_bt_binsrch_insert() within _bt_findinsertloc() when it has to
generate a new XLOG_BTREE_DELETE record (by calling
_bt_dedup_one_page(), which calls _bt_delitems_delete() in a way that
isn't dependent on the BTP_HAS_GARBAGE status bit being set).

> Anyway, it's worth to add more comments in
> _bt_killitems() explaining why it's safe.

There is no question that the little snippet of code I added to
_bt_killitems() in v11 is still too complicated. We also have to
consider cases where the array overflows because the scan direction
was changed (see the kill_prior_tuple comment block in btgetuple()).
Yeah, it's messy.

> Attached is v12, which contains WAL optimizations for posting split and
> page
> deduplication.

Cool.

> * xl_btree_split record doesn't contain posting tuple anymore, instead
> it keeps
> 'in_posting offset' and repeats the logic of _bt_insertonpg() as you
> proposed
> upthread.

That looks good.

> * I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains
> info about
> groups of tuples deduplicated into posting tuples. In principle, it is
> possible
> to fit it into some existing record, but I preferred to keep things clear.

I definitely think that inventing a new WAL record was the right thing to do.

> I haven't measured how these changes affect WAL size yet.
> Do you have any suggestions on how to automate testing of new WAL records?
> Is there any suitable place in regression tests?

I don't know about the regression tests (I doubt that there is a
natural place for such a test), but I came up with a rough test case.
I more or less copied the approach that you took with the index build
WAL reduction patches, though I also figured out a way of subtracting
heapam WAL overhead to get a real figure. I attach the test case --
note that you'll need to use the "land" database with this. (This test
case might need to be improved, but it's a good start.)

> * I also noticed that _bt_dedup_one_page() can be optimized to return early
> when none tuples were deduplicated. I wonder if we can introduce inner
> statistic to tune deduplication? That is returning to the idea of
> BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that
> have
> very few duplicates or pages that are already full of posting lists.

I think that the BT_COMPRESS_THRESHOLD idea is closely related to
making _bt_dedup_one_page() behave incrementally.

On my machine, v12 of the patch actually uses slightly more WAL than
v11 did with the nbtree_wal_test.sql test case -- it's 6510 MB of
nbtree WAL in v12 vs. 6502 MB in v11 (note that v11 benefits from WAL
compression, so if I turned that off v12 would probably win by a small
amount). Both numbers are wildly excessive, though. The master branch
figure is only 2011 MB, which is only about 1.8x the size of the index
on the master branch. And this is for a test case that makes the index
6.5x smaller, so the gap between total index size and total WAL volume
is huge here -- the volume of WAL is nearly 40x greater than the index
size!

You are right to wonder what the result would be if we put
BT_COMPRESS_THRESHOLD back in. It would probably significantly reduce
the volume of WAL, because _bt_dedup_one_page() would no longer
"thrash". However, I strongly suspect that that wouldn't be good
enough at reducing the WAL volume down to something acceptable. That
will require an approach to WAL-logging that is much more logical than
physical. The nbtree_wal_test.sql test case involves a case where page
splits mostly don't WAL-log things that were previously WAL-logged by
simple inserts, because nbtsplitloc.c has us split in a right-heavy
fashion when there are lots of duplicates. In other words, the
_bt_split() optimization to WAL volume naturally works very well with
the test case, or really any case with lots of duplicates, so the
"write amplification" to the total volume of WAL is relatively small
on the master branch.

I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally. Otherwise you end up
needlessly rewriting things that didn't actually change way too often.
That's definitely not okay. Why worry about bringing 40x down to 20x,
or even 10x? It needs to be comparable to the master branch.

> To be honest, I don't believe that incremental deduplication can really
> improve
> something, because no matter how many items were compressed we still
> rewrite
> all items from the original page to the new one, so, why not do our best.
> What do we save by this incremental approach?

The point of being incremental is not to save work in cases where a
page split is inevitable anyway. Rather, the idea is that we can be
even more lazy, and avoid doing work that will never be needed --
maybe delaying page splits actually means preventing them entirely.
Or, we can spread out the work over time, so that the amount of WAL
per checkpoint is smoother than what we would get with a batch
approach. My mental model of page splits is that there are sometimes
many of them on the same page again and again in a very short time
period, but more often the chances of any individual page being split
is low. Even the rightmost page of a serial PK index isn't truly an
exception, because a new rightmost page isn't "the same page" as the
original rightmost page -- it is its new right sibling.

Since we're going to have to optimize the WAL logging anyway, it will
be relatively easy to experiment with incremental deduplication within
_bt_dedup_one_page(). The WAL logging is the the hard part, so let's
focus on that rather than worrying too much about whether or not
incrementally doing all the work (not just the WAL logging) makes
sense. It's still too early to be sure about whether or not that's a
good idea.

--
Peter Geoghegan

Attachment Content-Type Size
nbtree_wal_test.sql application/octet-stream 4.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera from 2ndQuadrant 2019-09-11 21:20:39 Re: add a MAC check for TRUNCATE
Previous Message Fabien COELHO 2019-09-11 20:52:01 Re: psql - improve test coverage from 41% to 88%