Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
Date: 2018-07-18 02:23:58
Message-ID: CAH2-Wz=B=k4JO8onQ3_fk2bqAOrEBdNJyukqcT5c0MkfH-vZ7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 17, 2018 at 5:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Yes, that's a good point. Also, and I think pretty importantly, this
> seems essential if we want to allow retail index tuple deletion, which
> has its own set of advantages.

Retail index tuple deletion is clearly something we should have in
some form, and is the top reason to pursue the project. I could
probably come up with dozens of other reasons if I had to, though.
Here are a few of the less obvious ones:

* This improves usage_count of heap page buffers automatically, by
recognizing correlated references when there are many duplicates. [1]

* How else is VACUUM/deletion supposed to work with indirect indexes?
I think that it's impossible without either retail deletion/retail
updates, unless an AEL is okay.

* Global indexes (indexes that simultaneously cover multiple
partitions) are just a generalization of this idea, with a partition
number attribute before the heap TID attribute.

* The sort order of duplicates actually matters quite a lot to the
executor, and to the planner, which this lets us reason about in many
cases. [2][3]

* There is nothing to stop us from exploiting this within
tidbitmap.c/bitmap index scans. We can build disjunct lists of sorted
TIDs, one per value. Merging lists like this together is typically
very cheap, even when there are relatively many distinct lists. We
could probably do that on-the-fly.

* When building a bitmap for a low cardinality value from a secondary
index, why bother even visiting the leaf pages? We can make a lossy
bitmap from a subset of the internal pages.

* Techniques like bitmap semi-join, which are important for
star-schema queries, are dependent on the merging and intersection of
fact table bitmaps being cheap (basically, you build several bitmaps
through nestloop joins with some dimension table on the outer side,
and some fact table index on the inner side -- that's relatively easy
to parallelize well). Tom's "Improve OR conditions on joined columns"
patch sorts a list of TIDs [4], and I suspect that there is some
high-level connection there.

* Heap prefetching for nbtree index scans is a lot more effective than
it might otherwise be.

I really could go on, but you get the idea. The merits of each of
these ideas are debatable, but the fact that there are so many ideas
that are at least plausible-sounding seems significant to me.

> I don't think you're going to be able to rip out the getting-tired
> code, though, because presumably we'll have to continue support
> existing btree indexes that don't include TIDs for some
> probably-not-small number of future releases, even if we decide that
> all future btree indexes (and any that are reindexed) should have
> TIDs.

That's definitely true. There is no way that it's okay to ignore this
issue with pg_upgrade.

[1] https://postgr.es/m/CAH2-WzkrKKW88Yq4-BLN5N3DrFv93P8r+ti-KfbTeR08P8ckBQ@mail.gmail.com
[2] https://postgr.es/m/20170707234119.GN17566@telsasoft.com
[3] https://postgr.es/m/520D6610.8040907%40emulex.com#520D6610.8040907@emulex.com
[4] https://postgr.es/m/22002.1487099934%40sss.pgh.pa.us
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2018-07-18 03:12:06 Re: [bug fix] Produce a crash dump before main() on Windows
Previous Message Michael Paquier 2018-07-18 02:17:06 Re: PG 10: could not generate random cancel key