Re: LSM tree for Postgres

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: LSM tree for Postgres
Date: 2020-08-04 16:56:55
Message-ID: 5072d21a-587d-79b3-6b7b-e630cad2a778@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04.08.2020 18:04, Alexander Korotkov wrote:
> Hi!
>
> On Tue, Aug 4, 2020 at 11:22 AM Konstantin Knizhnik
> <k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
>> I want to share results of my last research of implementing LSM index in Postgres.
>> Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM tree instead of classical B-Tree.
> I wouldn't say it that way. I would say they are providing LSM in
> addition to the B-tree. For instance WiredTiger (which is the main
> engine for MongoDB) provides both B-tree and LSM. As I know Tarantool
> provides at least an in-memory B-tree. If RocksDB is used as an
> engine for MySQL, then it's also an addition to B-tree, which is
> provided by InnoDB. Also, implementation of B-tree's in mentioned
> DBMSes are very different. I would say none of them is purely
> classical.
I am not suggesting to completely replace B-Tree with LSM.
My experiments shows tah Postgres nbtree is faster than RocksDB when
index is small and fits in memory.
Definitely I have suggest to use LSM only for huge tables which indexes
are much larger than size of available memory.

> Some results of benchmarking.
>> Benchmark is just insertion of 250 millions of records with random key in inclusive index containing one bigint key and 8 bigint fields.
>> SIze of index is about 20Gb and target system has 16GB of RAM:
> What storage do you use?

It is my notebook with 16GB of RAM and SSD.
Certainly it should be tested at more serious hardware.
But there is a "problem": powerful servers now have hundreds of
gigabytes of memory.
To let LSM index demonstrates it advantages we need to create index not
fitting in memory.
And CPU speed of very expensive servers is not significantly faster than
of my notebook.
Performing may inserts in parallel also will not significantly increase
population of table with data: multiple bottlenecks in Postgres
do not allow to reach liner scalability even thoughwe have hundreds of
CPU cores.

> Yes, integration of WAL and snapshots between Postgres and RocksDB is
> problematic. I also doubt that RocksDB can use the full power of
> Postgres extendable type system.
This implementation support only basic scala Postgres types.
Actually RocksDB is dealing only with string key-value pairs.
So we need to serialize Postgres type into array of bytes (and provide
right ordering)!

>
> You use a fixed number of trees. Is this a limitation of prototype or
> intention of design? I guess if the base index is orders of magnitude
> bigger than RAM, this scheme can degrade greatly.
I do not understand why do we need multiple indexes.
We need one "hot" index which fits in memory to perform fast inserts.
But it should not be too small to be able to accumulate substantial
amount of records to provide efficient bulk insert.
I expect that top index can be efficiently merger with based index
because of better access locality.
I.e. we will insert multiple entries into one B-Tree page and so
minimize slowdown of random reads.

Third index is needed to perform parallel merge (while merge is in
progress top index will be blocked and we can not perform inserts in it).
I do not understand benefits of performing more than one merge in
parallel: it will only increase fraction of random reads.

Degradation certainly takes place. But it is not so critical as in case
of standard nbtree.
It is possible to tune threshold for top index size to make merge most
efficient.
But we can not truncate and swap index before we complete the merge.
So if merge operation takes long time, then it will cause exhaustion of
top index and it will not fit in memory any more.
It will lead to further slowdown (so we have negative feedback here).

Certainly it is possible to create more top indexes, keeping their size
small enough to fit in memory.
But in this case search will require merge not of 3, but of N indexes.
I think that it may cause unacceptable slowdown of search operations.
And it is highly undesirable, taken in account that most of application
send more select-only queries than updates.

>
>> As far as we are using standard nbtree indexes there is no need to worry about logging information in WAL.
>> There is no need to use inefficient "generic WAL records" or patch kernel by adding own WAL records.
> As I get the merge operations are logged in the same way as ordinal
> inserts. This seems acceptable, but a single insert operation would
> eventually cause in times more WAL than it does in B-tree (especially
> if we'll have an implementation of a flexible number of trees). In
> principle that could be evaded if recovery could replay logic of trees
> merge on its side. But this approach hardly can fit Postgres in many
> ways.

Yes, this approach increase mount of logged data twice:
we need to write in WAL inserts in top and base indexes.
And it will certainly have negative influence on performance.
Unfortunately I have no idea how to avoid it without patching Postgres core.
>
>> I have performed the same benchmark with random inserts (described above) for Lsm3:
>>
>> Index Clients TPS
>> Inclusive B-Tree 1 9387
>> Inclusive B-Tree 10 18761
>> RocksDB FDW 1 138350
>> RocksDB FDW 10 122369
>> RocksDB 1 166333
>> RocksDB 10 141482
>> Lsm3 1 151699
>> Lsm3 10 65997
>>
>> Size of nbtree is about 29Gb, single client performance is even higher than of RocksDB FDW, but parallel results are signficantly worser.
> Did you investigate the source of degradation? Such degradation
> doesn't yet look inevitable for me. Probably, things could be
> improved.

As  I explained above insertion in Lsm3 now is process with negative
feedback.
If insertion rate is higher than merge speed then top index is blown and
doesn't fit in memory any more which in turn cause
more slowdown of merge and as a result - further increase of top index size.
Several parallel clients performing inserts can fill top index faster
than single client does and faster than it can be merged with main index.

I have tried several approaches which try to slowdown inserts to prevent
undesired growth of top index
(I  have considered two criteria: size of top index and time of merge
operation).
But none of my attempts was successful: its leads to even worse performance.

>
>> I will be glad to receive any feedback, may be some change requests or proposals.
> As I get you benchmarked just inserts. But what about vacuum? I
> think the way Postgres vacuum works for now isn't optimal for lsm.
> Postgres vacuum requires full scan of index, because it provides a
> bitmap of tids to be deleted without information of index keys. For
> lsm, it would be better if vacuum would push delete requests to the
> top level of lsm (as specially marked tuples of something). Thanks to
> that index deletes could be as efficient as inserts. This is
> especially important for lsm with many levels and/or aggressive
> vacuum.

Right now vacuum process Lsm3 indexes in usual way.
Removing records from top indexes may be not needed at all (just because
this indexes will be truncated in any case).
But as far as size of top index is expected to be small enough
vacuumming it should not take a long time,
so I didn't to avoid it (although it should not be difficult - just
disable ambulkdelete for correspondent  nbtree wrappers).
Concerning deletes from main index - I do not understand how it can be
optimized.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2020-08-04 17:18:01 Re: LSM tree for Postgres
Previous Message Alvaro Herrera 2020-08-04 16:56:25 Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY