Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-09-03 01:53:09
Message-ID: CAH2-WzkwDobJU9sgq8J2gaZk_Eo6Dsyny1-gWXhY-9VDZnHOTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Aug 31, 2019 at 1:04 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I've found a fix for these Valgrind issues.

Attach is v10, which fixes the Valgrind issue.

Other changes:

* The code now fully embraces the idea that posting list splits
involve "changing the incoming item" in a way that "avoids" having the
new/incoming item overlap with an existing posting list tuple. This
allowed me to cut down on the changes required within nbtinsert.c
considerably.

* Streamlined a lot of the code in nbtsearch.c. I was able to
significantly simplify _bt_compare() and _bt_binsrch_insert().

* Removed the DEBUG4 traces. A lot of these had to go when I
refactored nbtsearch.c code, so I thought I might as well removed the
remaining ones. I hope that you don't mind (go ahead and add them back
where that makes sense).

* A backwards scan will return "logical tuples" in descending order
now. We should do this on general principle, and also because of the
possibility of future external code that expects and takes advantage
of consistent heap TID order.

This change might even have a small performance benefit today, though:
Index scans that visit multiple heap pages but only match on a single
key will only pin each heap page visited once. Visiting the heap pages
in descending order within a B-Tree page full of duplicates, but
ascending order within individual posting lists could result in
unnecessary extra pinning.

* Standardized terminology. We consistently call what the patch adds
"deduplication" rather than "compression".

* Added a new section on the design to the nbtree README. This is
fairly high level, and talks about dynamics that we can't really talk
about anywhere else, such as how nbtsplitloc.c "cooperates" with
deduplication, producing an effect that is greater than the sum of its
parts.

* I also made some changes to the WAL logging for leaf page insertions
and page splits.

I didn't add the optimization that you anticipated in your nbtxlog.h
comments (i.e. only WAL-log a rewritten posting list when it will go
on the left half of the split, just like the new/incoming item thing
we have already). I agree that that's a good idea, and should be added
soon. Actually, I think the whole "new item vs. rewritten posting list
item" thing makes the WAL logging confusing, so this is not really
about performance.

Maybe the easiest way to do this is also the way that performs best.
I'm thinking of this: maybe we could completely avoid WAL-logging the
entire rewritten/split posting list. After all, the contents of the
rewritten posting list are derived from the existing/original posting
list, as well as the new/incoming item. We can make the WAL record
much smaller on average by making standbys repeat a little bit of the
work performed on the primary. Maybe we could WAL-log
"in_posting_offset" itself, and an ItemPointerData (obviously the new
item offset number tells us the offset number of the posting list that
must be replaced/memmoved()'d). Then have the standby repeat some of
the work performed on the primary -- at least the work of swapping a
heap TID could be repeated on standbys, since it's very little extra
work for standbys, but could really reduce the WAL volume. This might
actually be simpler.

The WAL logging that I didn't touch in v10 is the most important thing
to improve. I am talking about the WAL-logging that is performed as
part of deduplicating all items on a page, to avoid a page split (i.e.
the WAL-logging within _bt_dedup_one_page()). That still just does a
log_newpage_buffer() in v10, which is pretty inefficient. Much like
the posting list split WAL logging stuff, WAL logging in
_bt_dedup_one_page() can probably be made more efficient by describing
deduplication in terms of logical changes. For example, the WAL
records should consist of metadata that could be read by a human as
"merge the tuples from offset number 15 until offset number 27".
Perhaps this could also share code with the posting list split stuff.
What do you think?

Once we make the WAL-logging within _bt_dedup_one_page() more
efficient, that also makes it fairly easy to make the deduplication
that it performs occur incrementally, maybe even very incrementally. I
can imagine the _bt_dedup_one_page() caller specifying "my new tuple
is 32 bytes, and I'd really like to not have to split the page, so
please at least do enough deduplication to make it fit". Delaying
deduplication increases the amount of time that we have to set the
LP_DEAD bit for remaining items on the page, which might be important.
Also, spreading out the volume of WAL produced by deduplication over
time might be important with certain workloads. We would still
probably do somewhat more work than strictly necessary to avoid a page
split if we were to make _bt_dedup_one_page() incremental like this,
though not by a huge amount.

OTOH, maybe I am completely wrong about "incremental deduplication"
being a good idea. It seems worth experimenting with, though. It's not
that much more work on top of making the _bt_dedup_one_page()
WAL-logging efficient, which seems like the thing we should focus on
now.

Thoughts?
--
Peter Geoghegan

Attachment Content-Type Size
v10-0002-DEBUG-Add-pageinspect-instrumentation.patch application/octet-stream 7.8 KB
v10-0001-Add-deduplication-to-nbtree.patch application/octet-stream 103.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-09-03 01:58:18 Re: pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout
Previous Message David Rowley 2019-09-03 01:13:43 Re: Converting NOT IN to anti-joins during planning