Re: [WIP] Effective storage of duplicates in B-tree index.

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] Effective storage of duplicates in B-tree index.
Date: 2016-07-04 09:30:05
Message-ID: a5dec069-c723-bc39-5bec-3e1baebde2c9@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18/03/16 19:19, Anastasia Lubennikova wrote:
> Please, find the new version of the patch attached. Now it has WAL
> functionality.
>
> Detailed description of the feature you can find in README draft
> https://goo.gl/50O8Q0
>
> This patch is pretty complicated, so I ask everyone, who interested in
> this feature,
> to help with reviewing and testing it. I will be grateful for any feedback.
> But please, don't complain about code style, it is still work in progress.
>
> Next things I'm going to do:
> 1. More debugging and testing. I'm going to attach in next message
> couple of sql scripts for testing.
> 2. Fix NULLs processing
> 3. Add a flag into pg_index, that allows to enable/disable compression
> for each particular index.
> 4. Recheck locking considerations. I tried to write code as less
> invasive as possible, but we need to make sure that algorithm is still
> correct.
> 5. Change BTMaxItemSize
> 6. Bring back microvacuum functionality.

I think we should pack the TIDs more tightly, like GIN does with the
varbyte encoding. It's tempting to commit this without it for now, and
add the compression later, but I'd like to avoid having to deal with
multiple binary-format upgrades, so let's figure out the final on-disk
format that we want, right from the beginning.

It would be nice to reuse the varbyte encoding code from GIN, but we
might not want to use that exact scheme for B-tree. Firstly, an
important criteria when we designed GIN's encoding scheme was to avoid
expanding on-disk size for any data set, which meant that a TID had to
always be encoded in 6 bytes or less. We don't have that limitation with
B-tree, because in B-tree, each item is currently stored as a separate
IndexTuple, which is much larger. So we are free to choose an encoding
scheme that's better at packing some values, at the expense of using
more bytes for other values, if we want to. Some analysis on what we
want would be nice. (It's still important that removing a TID from the
list never makes the list larger, for VACUUM.)

Secondly, to be able to just always enable this feature, without a GUC
or reloption, we might need something that's faster for random access
than GIN's posting lists. Or can we just add the setting, but it would
be nice to have some more analysis on the worst-case performance before
we decide on that.

I find the macros in nbtree.h in the patch quite confusing. They're
similar to what we did in GIN, but again we might want to choose
differently here. So some discussion on the desired IndexTuple layout is
in order. (One clear bug is that using the high bit of BlockNumber for
the BT_POSTING flag will fail for a table larger than 2^31 blocks.)

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Krzysztof Kaczkowski 2016-07-04 09:33:07 Cluster on NAS and data center.
Previous Message Masahiko Sawada 2016-07-04 09:01:15 Re: Reviewing freeze map code