Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
Date: 2018-07-17 23:39:47
Message-ID: CAH2-WzmEaBCpa1UyTVeWZO-_ibjexUFEu=GgJsNbVp9oa5pXHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 17, 2018 at 3:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Well, the actual problem was O(N^2) behavior:
>
> https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us
>
> https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=40549e9cb5abd2986603883e4ab567dab34723c6

Oh, yeah. We still put the new tuple to the left of all existing
duplicates on the leaf that we decide to insert on, because "we'll
insert it before any existing equal keys because of the way
_bt_binsrch() works". I actually plan on mostly fixing that aspect of
_bt_binsrch() while I'm at it, which is why I didn't think to frame it
that way.

It is claimed that we need to do this _bt_binsrch() go-left-on-equal
thing for internal pages because we allow duplicates, contrary to L&Y.
I find it much easier to think of it as being necessary due to index
scans where only some columns are specified in the initial insertion
scan key that finds the initial leaf page (i.e. the _bt_first()
stuff). I'm not going to touch the _bt_binsrch() go-left-on-equal
thing, because it's actually necessary for any real world
implementation that has multiple columns in tuples. Fortunately, we
can usually avoid the go-left-on-equal thing for internal pages by
avoiding equality -- a sentinel heap scan TID value may be used for
this in many cases.

The Lehman and Yao paper is badly written. They should have talked
about the _bt_binsrch() go-left-on-equal thing, rather than leaving it
to the reader to figure it out. It's not why L&Y require unique keys,
though.

Sorry if all this detail isn't useful right now. The important point
is that I actually think that this code gets most things right
already.

> I certainly have no objection to improving matters, but let's be sure
> we don't re-introduce any two-decade-old problems.

A mountain of work remains to validate the performance of the patch.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-07-17 23:58:06 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message Michael Paquier 2018-07-17 23:07:48 Re: Fix some error handling for read() and errno