Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2020-01-14 00:07:40
Message-ID: CAH2-WzmQA0NtcBSX-Msci5kJ3Eff+yA2QOATsMuQ5LzOiZs7hQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 10, 2020 at 1:36 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> * HEIKKI: Do we only generate one posting list in one WAL record? I
> would assume it's better to deduplicate everything on the page, since
> we're modifying it anyway.

Still thinking about this one.

> * HEIKKI: Does xl_btree_vacuum WAL record store a whole copy of the
> remaining posting list on an updated tuple? Wouldn't it be simpler and
> more space-efficient to store just the deleted TIDs?

This probably makes sense. The btreevacuumposting() code that
generates "updated" index tuples (tuples that VACUUM uses to replace
existing ones when some but not all of the TIDs need to be removed)
was derived from GIN's ginVacuumItemPointers(). That approach works
well enough, but we can do better now. It shouldn't be that hard.

My preferred approach is a little different to your suggested approach
of storing the deleted TIDs directly. I would like to make it work by
storing an array of uint16 offsets into a posting list, one array per
"updated" tuple (with one offset per deleted TID within each array).
These arrays (which must include an array size indicator at the start)
can appear in the xl_btree_vacuum record, at the same place the patch
currently puts a raw IndexTuple. They'd be equivalent to a raw
IndexTuple -- the REDO routine would reconstruct the same raw posting
list tuple on its own. This approach seems simpler, and is clearly
very space efficient.

This approach is similar to the approach used by REDO routines to
handle posting list splits. Posting list splits must call
_bt_swap_posting() on the primary, while the corresponding REDO
routines also call _bt_swap_posting(). For space efficient "updates",
we'd have to invent a sibling utility function -- we could call it
_bt_delete_posting(), and put it next to _bt_swap_posting() within
nbtdedup.c.

How do you feel about that approach? (And how do you feel about the
existing "REDO routines call _bt_swap_posting()" business that it's
based on?)

> * HEIKKI: Would it be more clear to have a separate struct for the
> posting list split case? (i.e. don't reuse xl_btree_insert)

I've concluded that this one probably isn't worthwhile. We'd have to
carry a totally separate record on the stack within _bt_insertonpg().
If you feel strongly about it, I will reconsider.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-01-14 00:19:38 Re: DROP OWNED CASCADE vs Temp tables
Previous Message Peter Geoghegan 2020-01-13 23:49:40 Re: Amcheck: do rightlink verification with lock coupling