Re: GIN improvements part 1: additional information

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: obartunov(at)gmail(dot)com
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN improvements part 1: additional information
Date: 2013-12-20 15:59:06
Message-ID: 52B4694A.3060201@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/19/2013 10:44 AM, Heikki Linnakangas wrote:
> On 12/19/2013 08:37 AM, Oleg Bartunov wrote:
>> Guys,
>>
>> before digging deep into the art of comp/decomp world I'd like to know
>> if you familiar with results of
>> http://wwwconference.org/www2008/papers/pdf/p387-zhangA.pdf paper and
>> some newer research ?
>
> Yeah, I saw that paper.
>
>> Do we agree in what we really want ? Basically,
>> there are three main features: size, compression, decompression speed
>> - we should take two :)
>
> According to that Zhang et al paper you linked, the Vbyte method
> actually performs the worst on all of those measures. The other
> algorithms are quite similar in terms of size (PForDelta being the most
> efficient), while PForDelta is significantly faster to compress/decompress.
>
> Just by looking at those numbers, PForDelta looks like a clear winner.
> However, it operates on much bigger batches than the other algorithms; I
> haven't looked at it in detail, but Zhang et al used 128 integer
> batches, and they say that 32 integers is the minimum batch size. If we
> want to use it for the inline posting lists stored in entry tuples, that
> would be quite wasteful if there are only a few item pointers on the tuple.
>
> Also, in the tests I've run, the compression/decompression speed is not
> a significant factor in total performance, with either varbyte encoding
> or Simple9-like encoding I hacked together.

One disadvantage of Simple9 and other encodings that operate in batches
is that removing a value from the middle can increase the number of
bytes required for the remaining values. That's a problem during vacuum;
it's possible that after vacuuming away one item pointer, the remaining
items no longer fit on the page.

- Heikki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Oskari Saarenmaa 2013-12-20 17:01:37 [PATCH] Make various variables read-only (const)
Previous Message Adrian Klaver 2013-12-20 15:54:25 Re: pg_upgrade & tablespaces