[PROPOSAL] Effective storage of duplicates in B-tree index.

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [PROPOSAL] Effective storage of duplicates in B-tree index.
Date: 2015-08-31 07:41:15
Message-ID: 55E4051B.7020209@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.

1. Compatibility.
It's important to save compatibility with older index versions.
I'm going to change BTREE_VERSION to 3.
And use new (posting) features for v3, saving old implementation for v2.
Any objections?

2. There are several tricks to handle non-unique keys in B-tree.
More info in btree readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/README>
(chapter - Differences to the Lehman & Yao algorithm).
In the new version they'll become useless. Am I right?

3. Microvacuum.
Killed items are marked LP_DEAD and could be deleted from separate page
at time of insertion.
Now it's fine, because each item corresponds with separate TID. But
posting list implementation requires another way. I've got two ideas:
First is to mark LP_DEAD only those tuples where all TIDs are not visible.
Second is to add LP_DEAD flag to each TID in posting list(tree). This
way requires a bit more space, but allows to do microvacuum of posting
list/tree.
Which one is better?

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2015-08-31 09:12:02 Re: Horizontal scalability/sharding
Previous Message Jeff Janes 2015-08-31 07:39:47 Re: Buildfarm failure from overly noisy warning message