Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Юрий Соколов <funny(dot)falcon(at)gmail(dot)com>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2020-01-29 01:29:05
Message-ID: CAH2-Wzme6NA70bkHK=Jrjx+bkHr9DAj9gX+y5OpmvEVNy0eqVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 14, 2020 at 6:08 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Still no progress on these items, but I am now posting v30. A new
> version seems warranted, because I now want to revive a patch from a
> couple of years back as part of the deduplication project -- it would
> be good to get feedback on that sooner rather than later.

Actually, I decided that this wasn't necessary -- I won't be touching
compactify_tuples() at all (at least not right now). Deduplication
doesn't even need to use PageIndexMultiDelete() in the attached
revision of the patch, v31, so speeding up compactify_tuples() is no
longer relevant.

v31 simplifies everything quite a bit. This is something that I came
up with more or less as a result of following Heikki's feedback. I
found that reviving the v17 approach of using a temp page buffer in
_bt_dedup_one_page() (much like _bt_split() always has) was a good
idea. This approach was initially revived in order to make dedup WAL
logging work on a whole-page basis -- Heikki suggested we do it that
way, and so now we do. But this approach is also a lot faster in
general, and has additional benefits besides that.

When we abandoned the temp buffer approach back in September of last
year, the unique index stuff was totally immature and unsettled, and
it looked like a very incremental approach might make sense for unique
indexes. It doesn't seem like a good idea now, though. In fact, I no
longer even believe that a custom checkingunique/unique index strategy
in _bt_dedup_one_page() is useful. That is also removed in v31, which
will also make Heikki happy -- he expressed a separate concern about
the extra complexity there.

I've done a lot of optimization work since September, making these
simplification possible now. The problems that I saw that justified
the complexity seem to have gone away now. I'm pretty sure that the
recent _bt_check_unique() posting list tuple _bt_compare()
optimization is the biggest part of that. The checkingunique/unique
index strategy in _bt_dedup_one_page() always felt overfit to my
microbenchmarks, so I'm glad to be rid of it.

Note that v31 changes nothing about how we think about deduplication
in unique indexes in general, nor how it is presented to users. There
is still special criteria around how deduplication is *triggered* in
unique indexes. We continue to trigger a deduplication pass based on
seeing a duplicate within _bt_check_unique() + _bt_findinsertloc() --
otherwise we never attempt deduplication in a unique index (same as
before). Plus the GUC still doesn't affect unique indexes, unique
index deduplication still isn't really documented in the user docs (it
just gets a passing mention in B-Tree internals section), etc.

In my opinion, the patch is now pretty close to being committable. I
do have two outstanding open items for the patch, though. These items
are:

* We still need infrastructure that marks B-Tree opclasses as safe for
deduplication, to avoid things like the numeric display scale problem,
collations that are unsafe for deduplication because they're
nondeterministic, etc.

I talked to Anastasia about this over private e-mail recently. This is
going well; I'm expecting a revision later this week. It will be based
on all feedback to date over on the other thread [1] that we have for
this part of the project.

* Make VACUUM's WAL record more space efficient when it contains one
or more "updates" to an existing posting list tuple.

Currently, when VACUUM must delete some but not all TIDs from a
posting list, we generate a new posting list tuple and dump it into
the WAL stream -- the REDO routine simply overwrites the existing item
with a version lacking the TIDs that have to go. This could be space
inefficient with certain workloads, such as workloads where only one
or two TIDs are deleted from a very large posting list tuple again and
again. Heikki suggested I do something about this. I intend to at
least research the problem, and can probably go ahead with
implementing it without any trouble.

What nbtree VACUUM does in the patch right now is roughly the same as
what GIN's VACUUM does for posting lists within posting tree pages --
see ginVacuumPostingTreeLeaf() (we're horribly inefficient about WAL
logging when VACUUM'ing a GIN entry tree leaf page, which works
differently, and isn't what I'm talking about -- see
ginVacuumEntryPage()). We might as well do better than
GIN/ginVacuumPostingTreeLeaf() here if we can.

The patch is pretty clever about minimizing the volume of WAL in all
other contexts, managing to avoid any other case of what could be
described as "WAL space amplification". Maybe we should do the same
with the xl_btree_vacuum record just to be *consistent* about it.

[1] https://www.postgresql.org/message-id/flat/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ(at)mail(dot)gmail(dot)com
--
Peter Geoghegan

Attachment Content-Type Size
v31-0002-Teach-pageinspect-about-nbtree-posting-lists.patch application/x-patch 18.5 KB
v31-0003-DEBUG-Show-index-values-in-pageinspect.patch application/x-patch 4.4 KB
v31-0001-Add-deduplication-to-nbtree.patch application/x-patch 213.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-01-29 01:36:39 Re: Enabling B-Tree deduplication by default
Previous Message Kohei KaiGai 2020-01-29 01:23:04 Re: Is custom MemoryContext prohibited?