Re: LSM tree for Postgres

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: LSM tree for Postgres
Date: 2020-08-09 07:26:17
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09.08.2020 04:53, Alexander Korotkov wrote:
>> I realize that it is not true LSM.
>> But still I wan to notice that it is able to provide ~10 times increase
>> of insert speed when size of index is comparable with RAM size.
>> And "true LSM" from RocksDB shows similar results.
> It's very far from being shown. All the things you've shown is a
> naive benchmark. I don't object that your design can work out some
> cases. And it's great that we have the lsm3 extension now. But I
> think for PostgreSQL core we should think about better design.

Sorry, I mean that at particular benchmark and hardware Lsm3 and RocksDB
shows similar performance.
It definitely doesn't mean that it will be true in all other cases.
This is one of the reasons why I have published this Lsm3 and RockDB FDW
anybody can try to test them at their workload.
It will be very interesting to me to know this results, because I
certainly understand
that measuring of random insert performance in dummy table is not enough
to make some

And I certainly do not want to say that we do not need "right" LSM
implementation inside Postgres core.
It just requires an order of magnitude more efforts.
And there are many questions and challenges. For example Postgres buffer
size (8kb) seems to be too small for LSM.
Should LSM implementation bypass Postgres buffer cache? There pros and

Another issue is logging. Should we just log all operations with LSM in
WAL in usual way (as it is done for nbtree and Lsm3)?
It seems to me that for LSM alternative and more efficient solutions may
be proposed.
For example we may not log inserts in top index at all and just replay
them during recovery, assuming that this operation with
small index is fast enough. And merge of top index with base index can
be done in atomic way and so also doesn't require WAL.

As far as I know Anastasia Lubennikova several years ago has implemented
LSM for Postgres.
There was some performance issues (with concurrent access?).
This is why the first thing I want to clarify for myself is what are the
bottlenecks of LSM architecture
and are them caused by LSM itself or its integration in Postgres

I any case, before thinking about details of in-core LSM implementation
for Postgres, I think that
it is necessary to demonstrate workloads at which RocksDB (or any other
existed DBMS with LSM)
shows significant performance advantages comparing with Postgres with

>> May be if size of
>> index will be 100 times larger then
>> size of RAM, RocksDB will be significantly faster than Lsm3. But modern
>> servers has 0.5-1Tb of RAM.
>> Can't believe that there are databases with 100Tb indexes.
> Comparison of whole RAM size to single index size looks plain wrong
> for me. I think we can roughly compare whole RAM size to whole
> database size. But also not the whole RAM size is always available
> for caching data. Let's assume half of RAM is used for caching data.
> So, a modern server with 0.5-1Tb of RAM, which suffers from random
> B-tree insertions and badly needs LSM-like data-structure, runs a
> database of 25-50Tb. Frankly speaking, there is nothing
> counterintuitive for me.

There is actually nothing counterintuitive.
I just mean that there are not so much 25-50Tb OLTP databases.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-08-09 08:41:15 Re: Unnecessary delay in streaming replication due to replay lag
Previous Message Andrey M. Borodin 2020-08-09 06:08:52 Re: Amcheck: do rightlink verification with lock coupling