Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-07-04 17:38:21
Message-ID: CAH2-WzkyWGpsGffErjqJ866VTEPHX0PGo3n1N-Xr7ycvooNF7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
> i - number of distinct values in the index.
> So i=1 means that all rows have the same key,
> and i=10000000 means that all keys are different.
>
> i / old size (MB) / new size (MB)
> 1 215 88
> 1000 215 90
> 100000 215 71
> 10000000 214 214
>
> For more, see the attached diagram with test results.

I tried this on my own "UK land registry" test data [1], which was
originally used for the v12 nbtree work. My test case has a low
cardinality, multi-column text index. I chose this test case because
it was convenient for me.

On v12/master, the index is 1100MB. Whereas with your patch, it ends
up being 196MB -- over 5.5x smaller!

I also tried it out with the "Mouse genome informatics" database [2],
which was already improved considerably by the v12 work on duplicates.
This is helped tremendously by your patch. It's not quite 5.5x across
the board, of course. There are 187 indexes (on 28 tables), and almost
all of the indexes are smaller. Actually, *most* of the indexes are
*much* smaller. Very often 50% smaller.

I don't have time to do an in-depth analysis of these results today,
but clearly the patch is very effective on real world data. I think
that we tend to underestimate just how common indexes with a huge
number of duplicates are.

[1] https://https:/postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
[2] http://www.informatics.jax.org/software.shtml
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-07-04 17:46:54 Re: Optimize partial TOAST decompression
Previous Message Kumar, Pawan (Nokia - IN/Bangalore) 2019-07-04 17:34:21 RE: duplicate key entries for primary key -- need urgent help