Re: [WIP] Effective storage of duplicates in B-tree index.

From: Thom Brown <thom(at)linux(dot)com>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Effective storage of duplicates in B-tree index.
Date: 2016-01-28 20:32:38
Message-ID: CAA-aLv4k4eq=OjZeEBrO-vNyoNeUAgF+oYW=f+R0YLZpNKJjKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 28 January 2016 at 17:03, Thom Brown <thom(at)linux(dot)com> wrote:

>
> On 28 January 2016 at 16:12, Anastasia Lubennikova <
> a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
>
>>
>> 28.01.2016 18:12, Thom Brown:
>>
>> On 28 January 2016 at 14:06, Anastasia Lubennikova <
>> a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
>>
>>>
>>> 31.08.2015 10:41, Anastasia Lubennikova:
>>>
>>> Hi, hackers!
>>> I'm going to begin work on effective storage of duplicate keys in B-tree
>>> index.
>>> The main idea is to implement posting lists and posting trees for B-tree
>>> index pages as it's already done for GIN.
>>>
>>> In a nutshell, effective storing of duplicates in GIN is organised as
>>> follows.
>>> Index stores single index tuple for each unique key. That index tuple
>>> points to posting list which contains pointers to heap tuples (TIDs). If
>>> too many rows having the same key, multiple pages are allocated for the
>>> TIDs and these constitute so called posting tree.
>>> You can find wonderful detailed descriptions in gin readme
>>> <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
>>> and articles <http://www.cybertec.at/gin-just-an-index-type/>.
>>> It also makes possible to apply compression algorithm to posting
>>> list/tree and significantly decrease index size. Read more in presentation
>>> (part 1)
>>> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
>>>
>>> Now new B-tree index tuple must be inserted for each table row that we
>>> index.
>>> It can possibly cause page split. Because of MVCC even unique index
>>> could contain duplicates.
>>> Storing duplicates in posting list/tree helps to avoid superfluous
>>> splits.
>>>
>>>
>>> I'd like to share the progress of my work. So here is a WIP patch.
>>> It provides effective duplicate handling using posting lists the same
>>> way as GIN does it.
>>>
>>> Layout of the tuples on the page is changed in the following way:
>>> before:
>>> TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
>>> (ip_blkid, ip_posid) + key
>>> with patch:
>>> TID (N item pointers, posting list offset) + key, TID (ip_blkid,
>>> ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
>>>
>>> It seems that backward compatibility works well without any changes. But
>>> I haven't tested it properly yet.
>>>
>>> Here are some test results. They are obtained by test functions
>>> test_btbuild and test_ginbuild, which you can find in attached sql file.
>>> i - number of distinct values in the index. So i=1 means that all rows
>>> have the same key, and i=10000000 means that all keys are different.
>>> The other columns contain the index size (MB).
>>>
>>> i B-tree Old B-tree New GIN
>>> 1 214,234375 87,7109375 10,2109375
>>> 10 214,234375 87,7109375 10,71875
>>> 100 214,234375 87,4375 15,640625
>>> 1000 214,234375 86,2578125 31,296875
>>> 10000 214,234375 78,421875 104,3046875
>>> 100000 214,234375 65,359375 49,078125
>>> 1000000 214,234375 90,140625 106,8203125
>>> 10000000 214,234375 214,234375 534,0625
>>> You can note that the last row contains the same index sizes for B-tree,
>>> which is quite logical - there is no compression if all the keys are
>>> distinct.
>>> Other cases looks really nice to me.
>>> Next thing to say is that I haven't implemented posting list compression
>>> yet. So there is still potential to decrease size of compressed btree.
>>>
>>> I'm almost sure, there are still some tiny bugs and missed functions,
>>> but on the whole, the patch is ready for testing.
>>> I'd like to get a feedback about the patch testing on some real
>>> datasets. Any bug reports and suggestions are welcome.
>>>
>>> Here is a couple of useful queries to inspect the data inside the index
>>> pages:
>>> create extension pageinspect;
>>> select * from bt_metap('idx');
>>> select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
>>> n) as bt;
>>> select n, bt.* from generate_series(1,1) as n, lateral
>>> bt_page_items('idx', n) as bt;
>>>
>>> And at last, the list of items I'm going to complete in the near future:
>>> 1. Add storage_parameter 'enable_compression' for btree access method
>>> which specifies whether the index handles duplicates. default is 'off'
>>> 2. Bring back microvacuum functionality for compressed indexes.
>>> 3. Improve insertion speed. Insertions became significantly slower with
>>> compressed btree, which is obviously not what we do want.
>>> 4. Clean the code and comments, add related documentation.
>>>
>>
>> This doesn't apply cleanly against current git head. Have you caught up
>> past commit 65c5fcd35?
>>
>>
>> Thank you for the notice. New patch is attached.
>>
>
> Thanks for the quick rebase.
>
> Okay, a quick check with pgbench:
>
> CREATE INDEX ON pgbench_accounts(bid);
>
> Timing
> Scale: master / patch
> 100: 10657ms / 13555ms (rechecked and got 9745ms)
> 500: 56909ms / 56985ms
>
> Size
> Scale: master / patch
> 100: 214MB / 87MB (40.7%)
> 500: 1071MB / 437MB (40.8%)
>
> No performance issues from what I can tell.
>
> I'm surprised that efficiencies can't be realised beyond this point. Your
> results show a sweet spot at around 1000 / 10000000, with it getting
> slightly worse beyond that. I kind of expected a lot of efficiency where
> all the values are the same, but perhaps that's due to my lack of
> understanding regarding the way they're being stored.
>

Okay, now for some badness. I've restored a database containing 2 tables,
one 318MB, another 24kB. The 318MB table contains 5 million rows with a
sequential id column. I get a problem if I try to delete many rows from it:

# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory

The query completes, but I get this message a lot before it does.

This happens even if I drop the primary key and foreign key constraints, so
somehow the memory usage has massively increased with this patch.

Thom

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2016-01-28 21:06:37 thanks for FOSDEM/PGDay 2016 Developer Meeting
Previous Message Fabien COELHO 2016-01-28 20:20:12 Re: extend pgbench expressions with functions