Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-07-31 16:22:59
Message-ID: b3e5697f-8787-fcfa-0a17-04c12486cfa9@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

24.07.2019 4:22, Peter Geoghegan wrote:
>
> Attached is a revised version of your v2 that fixes this issue -- I'll
> call this v3. In general, my goal for the revision was to make sure
> that all of my old tests from the v12 work passed, and to make sure
> that amcheck can detect almost any possible problem. I tested the
> amcheck changes by corrupting random state in a test index using
> pg_hexedit, then making sure that amcheck actually complained in each
> case.
>
> I also fixed one or two bugs in passing, including the bug that caused
> an assertion failure in _bt_truncate(). That was down to a subtle
> off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
> make that many changes to your v2. There are probably some things
> about the patch that I still don't understand, or things that I have
> misunderstood.
>
Thank you for this review and fixes.
> * Changed the custom binary search code within _bt_compare_posting()
> to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
> of any reason not to do it that way?
It's ok to update it. There was no particular reason, just my habit.
> * Added quite a few "FIXME"/"XXX" comments at various points, to
> indicate where I have general concerns that need more discussion.

+         * FIXME:  The calls to BTreeGetNthTupleOfPosting() allocate
memory,

If we only need to check TIDs, we don't need BTreeGetNthTupleOfPosting(),
we can use BTreeTupleGetPostingN() instead and iterate over TIDs, not
tuples.

Fixed in version 4.

> * Included my own pageinspect hack to visualize the minimum TIDs in
> posting lists. It's broken out into a separate patch file. The code is
> very rough, but it might help someone else, so I thought I'd include
> it.
Cool, I think we should add it to the final patchset,
probably, as separate function by analogy with tuple_data_split.

> I also have some new concerns about the code in the patch that I will
> point out now (though only as something to think about a solution on
> -- I am unsure myself):
>
> * It's a bad sign that compression involves calls to PageAddItem()
> that are allowed to fail (we just give up on compression when that
> happens). For one thing, all existing calls to PageAddItem() in
> Postgres are never expected to fail -- if they do fail we get a "can't
> happen" error that suggests corruption. It was a good idea to take
> this approach to get the patch to work, and to prove the general idea,
> but we now need to fully work out all the details about the use of
> space. This includes complicated new questions around how alignment is
> supposed to work.

The main reason to implement this gentle error handling is the fact that
deduplication could cause storage overhead, which leads to running out
of space
on the page.

First of all, it is a legacy of the previous versions where
BTreeFormPostingTuple was not able to form non-posting tuple even in case
where a number of posting items is 1.

Another case that was in my mind is the situation where we have 2 tuples:
t_tid | t_info | key + t_tid | t_info | key

and compressed result is:
t_tid | t_info | key | t_tid | t_tid

If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.

I changed DEBUG message to ERROR in v4 and it passes all regression tests.
I doubt that it covers all corner cases, so I'll try to add more special
tests.

> Alignment in nbtree is already complicated today -- you're supposed to
> MAXALIGN() everything in nbtree, so that the MAXALIGN() within
> bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
> length (note that heapam can have tuples whose lp_len isn't aligned,
> so nbtree could do it differently if it proved useful). Code within
> nbtsplitloc.c fully understands the space requirements for the
> bufpage.c routines, and is very careful about it. (The bufpage.c
> details are supposed to be totally hidden from code like
> nbtsplitloc.c, but I guess that that ideal isn't quite possible in
> reality. Code comments don't really explain the situation today.)
>
> I'm not sure what it would look like for this patch to be as precise
> about free space as nbtsplitloc.c already is, even though that seems
> desirable (I just know that it would mean you would trust
> PageAddItem() to work in all cases). The patch is different to what we
> already have today in that it tries to add *less than* a single
> MAXALIGN() quantum at a time in some places (when a posting list needs
> to grow by one item). The devil is in the details.
>
> * As you know, the current approach to WAL logging is very
> inefficient. It's okay for now, but we'll need a fine-grained approach
> for the patch to be commitable. I think that this is subtly related to
> the last item (i.e. the one about alignment). I have done basic
> performance tests using unlogged tables. The patch seems to either
> make big INSERT queries run as fast or faster than before when
> inserting into unlogged tables, which is a very good start.
>
> * Since we can now split a posting list in two, we may also have to
> reconsider BTMaxItemSize, or some similar mechanism that worries about
> extreme cases where it becomes impossible to split because even two
> pages are not enough to fit everything. Think of what happens when
> there is a tuple with a single large datum, that gets split in two
> (the tuple is split, not the page), with each half receiving its own
> copy of the datum. I haven't proven to myself that this is broken, but
> that may just be because I haven't spent any time on it. OTOH, maybe
> you already have it right, in which case it seems like it should be
> explained somewhere. Possibly in nbtree.h. This is tricky stuff.
Hmm, I can't get the problem.
In current implementation each posting tuple is smaller than BTMaxItemSize,
so no split can lead to having tuple of larger size.
> * I agree with all of your existing TODO items -- most of them seem
> very important to me.
>
> * Do we really need to keep BTreeTupleGetHeapTID(), now that we have
> BTreeTupleGetMinTID()? Can't we combine the two macros into one, so
> that callers don't need to think about the pivot vs posting list thing
> themselves? See the new code added to _bt_mkscankey() by v3, for
> example. It now handles both cases/macros at once, in order to keep
> its amcheck caller happy. amcheck's verify_nbtree.c received similar
> ugly code in v3.
No, we don't need them both. I don't mind combining them into one macro.
Actually, we never needed BTreeTupleGetMinTID(),
since its functionality is covered by BTreeTupleGetHeapTID.
On the other hand, in some cases BTreeTupleGetMinTID() looks more readable.
For example here:

>        Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
                                  BTreeTupleGetMinTID(righttup)) < 0);

> * We should at least experiment with applying compression when
> inserting into unique indexes. Like Alexander, I think that
> compression in unique indexes might work well, given how they must
> work in Postgres.

The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.

One more thing I want to discuss:
 /*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.
The explanation I see is that page had DEAD items, but for some reason
BTP_HAS_GARBAGE was not set so _bt_vacuum_one_page() was not called.
I find it difficult to understand what could lead to this situation,
so probably we need to inspect it closer to exclude the possibility of a
bug.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
v4-0001-Compression-deduplication-in-nbtree.patch text/x-patch 82.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-07-31 16:41:39 Re: How to retain lesser paths at add_path()?
Previous Message Robert Haas 2019-07-31 15:46:05 Re: TopoSort() fix