Re: LSM tree for Postgres

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: LSM tree for Postgres
Date: 2020-08-04 15:04:40
Message-ID: CAPpHfdtBSr1=+3ZRTbVzTJocp8qcPnfgjAVH2-rJMWyLaSmjxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

On Tue, Aug 4, 2020 at 11:22 AM Konstantin Knizhnik
<k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
> I want to share results of my last research of implementing LSM index in Postgres.
> Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM tree instead of classical B-Tree.

I wouldn't say it that way. I would say they are providing LSM in
addition to the B-tree. For instance WiredTiger (which is the main
engine for MongoDB) provides both B-tree and LSM. As I know Tarantool
provides at least an in-memory B-tree. If RocksDB is used as an
engine for MySQL, then it's also an addition to B-tree, which is
provided by InnoDB. Also, implementation of B-tree's in mentioned
DBMSes are very different. I would say none of them is purely
classical.

> LSM approach tries to address this problem.

LSM has great use-cases for sure.

> I have significantly rewriten their FDW implementation: original RocksDB server implementation was single threaded.
> I have made it multitheaded making it possible to run multiple RocksDB requests in parallel.
> My implementation can be found there:
> https://github.com/postgrespro/lsm

Great, thank you for your work.

> Some results of benchmarking.
> Benchmark is just insertion of 250 millions of records with random key in inclusive index containing one bigint key and 8 bigint fields.
> SIze of index is about 20Gb and target system has 16GB of RAM:

What storage do you use?

> As you can see there is about 10 times difference.
> May be somebody will find useful this idea of using IOT (index organized table) based on RocksDB in Postgres.
> But this approach violates all ACID properties of Postgres:
> there is no atomicity and consistency (in principle RocksDB supports 2PC, but it is not used here),
> isolation corresponds to something like "read uncommitted",
> and concerning durability - it is all up to RocksDB and I have serious doubts that it will survive failure especially with sync write mode disabled.
> So I considered this project mostly as prototype for estimating efficiency of LSM approach.

Yes, integration of WAL and snapshots between Postgres and RocksDB is
problematic. I also doubt that RocksDB can use the full power of
Postgres extendable type system.

> Then I think about implementing ideas of LSM using standard Postgres nbtree.
>
> We need two indexes: one small for fast inserts and another - big (main) index. This top index is small enough to fit in memory
> so inserts in this index are very fast.
> Periodically we will merge data from top index to base index and truncate the top index. To prevent blocking of inserts in the table
> while we are merging indexes we can add ... on more index, which will be used during merge.
>
> So final architecture of Lsm3 is the following:
> two top indexes used in cyclic way and one main index. When top index reaches some threshold value
> we initiate merge with main index, done by bgworker and switch to another top index.
> As far as merging indexes is done in background, it doesn't affect insert speed.
> Unfortunately Postgres Index AM has not bulk insert operation, so we have to perform normal inserts.
> But inserted data is already sorted by key which should improve access locality and partly solve random reads problem for base index.
>
> Certainly to perform search in Lsm3 we have to make lookups in all three indexes and merge search results.
> (in case of unique index we can avoid extra searches if searched value is found in top index).
> It can happen that during merge of top and base indexes the same TID can be found in both of them.
> But such duplicates can be easily eliminated during merge of search results.

You use a fixed number of trees. Is this a limitation of prototype or
intention of design? I guess if the base index is orders of magnitude
bigger than RAM, this scheme can degrade greatly.

> As far as we are using standard nbtree indexes there is no need to worry about logging information in WAL.
> There is no need to use inefficient "generic WAL records" or patch kernel by adding own WAL records.

As I get the merge operations are logged in the same way as ordinal
inserts. This seems acceptable, but a single insert operation would
eventually cause in times more WAL than it does in B-tree (especially
if we'll have an implementation of a flexible number of trees). In
principle that could be evaded if recovery could replay logic of trees
merge on its side. But this approach hardly can fit Postgres in many
ways.

> I have performed the same benchmark with random inserts (described above) for Lsm3:
>
> Index Clients TPS
> Inclusive B-Tree 1 9387
> Inclusive B-Tree 10 18761
> RocksDB FDW 1 138350
> RocksDB FDW 10 122369
> RocksDB 1 166333
> RocksDB 10 141482
> Lsm3 1 151699
> Lsm3 10 65997
>
> Size of nbtree is about 29Gb, single client performance is even higher than of RocksDB FDW, but parallel results are signficantly worser.

Did you investigate the source of degradation? Such degradation
doesn't yet look inevitable for me. Probably, things could be
improved.

> I will be glad to receive any feedback, may be some change requests or proposals.

As I get you benchmarked just inserts. But what about vacuum? I
think the way Postgres vacuum works for now isn't optimal for lsm.
Postgres vacuum requires full scan of index, because it provides a
bitmap of tids to be deleted without information of index keys. For
lsm, it would be better if vacuum would push delete requests to the
top level of lsm (as specially marked tuples of something). Thanks to
that index deletes could be as efficient as inserts. This is
especially important for lsm with many levels and/or aggressive
vacuum.

------
Regards,
Alexander Korotkov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-08-04 15:11:17 Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
Previous Message Robert Haas 2020-08-04 14:59:31 Re: new heapcheck contrib module