Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-07-05 00:06:09
Message-ID: CAH2-WzkbpKVRQyS7VKvdRc2e-S_DHsJfozstZ6symWk-JPTTug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 4, 2019 at 10:38 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I tried this on my own "UK land registry" test data [1], which was
> originally used for the v12 nbtree work. My test case has a low
> cardinality, multi-column text index. I chose this test case because
> it was convenient for me.
>
> On v12/master, the index is 1100MB. Whereas with your patch, it ends
> up being 196MB -- over 5.5x smaller!

I also see a huge and consistent space saving for TPC-H. All 9 indexes
are significantly smaller. The lineitem orderkey index is "just" 1/3
smaller, which is the smallest improvement among TPC-H indexes in my
index bloat test case. The two largest indexes after the initial bulk
load are *much* smaller: the lineitem parts supplier index is ~2.7x
smaller, while the lineitem ship date index is a massive ~4.2x
smaller. Also, the orders customer key index is ~2.8x smaller, and the
order date index is ~2.43x smaller. Note that the test involved retail
insertions, not CREATE INDEX.

I haven't seen any regression in the size of any index so far,
including when the number of internal pages is all that we measure.
Actually, there seems to be cases where there is a noticeably larger
reduction in internal pages than in leaf pages, probably because of
interactions with suffix truncation.

This result is very impressive. We'll need to revisit what the right
trade-off is for the compression scheme, which Heikki had some
thoughts on when we left off 3 years ago, but that should be a lot
easier now. I am very encouraged by the fact that this relatively
simple approach already works quite nicely. It's also great to see
that bulk insertions with lots of compression are very clearly faster
with this latest revision of your patch, unlike earlier versions from
2016 that made those cases slower (though I haven't tested indexes
that don't really use compression). I think that this is because you
now do the compression lazily, at the point where it looks like we may
need to split the page. Previous versions of the patch had to perform
compression eagerly, just like GIN, which is not really appropriate
for nbtree.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-07-05 01:23:42 Re: mcvstats serialization code is still shy of a load
Previous Message Noah Misch 2019-07-04 23:33:53 Re: Fix runtime errors from -fsanitize=undefined