Quick Links

LSM tree for Postgres

From:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	LSM tree for Postgres
Date:	2020-08-04 08:22:13
Message-ID:	315b7ce8-9d62-3817-0a92-4b20519d0c51@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi hackers,

I want to share results of my last research of implementing LSM index in
Postgres.
Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM
tree instead of classical B-Tree.

From one side, capacity of RAM at modern servers allows to keep the
whole database in memory.
It leads to the anti-caching approach proposed by Michael Stonebraker
https://15721.courses.cs.cmu.edu/spring2016/papers/hstore-anticaching.pdf

From the other side: maintaining if indexes is one of the most
important factor limiting database performance.
Postgres is able to insert records in table without indexes almost with
linear disk write speed(hundred megabytes per second).
But if table contains indexes, inserted keys have random values and
indexes don't fill in memory then we observer dramatic degradation of
performance. Average HDD access time is about 10msec, which corresponds
to 100 random reads per second. If table has several indexes
and is large enough not to fit in memory, then insert speed can be as
low as tens TPS. Certainly SSD can provide much better random access time,
but still random reads are slow.

LSM approach tries to address this problem.
First of all I started my experiments with RocksDB (may be most popular
LSM-based key-value storage, used in many databases).
There was FDW for RocksDB from VidarDB project:
https://github.com/vidardb/pgrocks
As far as RocksDB is multuthreaded embedded database engine and Postgres
is based on multiprocess architecture,
them used interesting approach "server inside server": there is bgworker
process which works with RocksDB and
backends sendind requests to it through shared memory queue.

I have significantly rewriten their FDW implementation: original RocksDB
server implementation was single threaded.
I have made it multitheaded making it possible to run multiple RocksDB
requests in parallel.
My implementation can be found there:
https://github.com/postgrespro/lsm

Some results of benchmarking.
Benchmark is just insertion of 250 millions of records with random key
in inclusive index containing one bigint key and 8 bigint fields.
SIze of index is about 20Gb and target system has 16GB of RAM:

Index Clients TPS
Inclusive B-Tree 1 9387
Inclusive B-Tree 10 18761
RocksDB FDW 1 138350
RocksDB FDW 10 122369
RocksDB 1 166333
RocksDB 10 141482

As you can see there is about 10 times difference.
May be somebody will find useful this idea of using IOT (index organized
table) based on RocksDB in Postgres.
But this approach violates all ACID properties of Postgres:
there is no atomicity and consistency (in principle RocksDB supports
2PC, but it is not used here),
isolation corresponds to something like "read uncommitted",
and concerning durability - it is all up to RocksDB and I have serious
doubts that it will survive failure especially with sync write mode
disabled.
So I considered this project mostly as prototype for estimating
efficiency of LSM approach.

Then I think about implementing ideas of LSM using standard Postgres nbtree.

We need two indexes: one small for fast inserts and another - big (main)
index. This top index is small enough to fit in memory
so inserts in this index are very fast.
Periodically we will merge data from top index to base index and
truncate the top index. To prevent blocking of inserts in the table
while we are merging indexes we can add ... on more index, which will be
used during merge.

So final architecture of Lsm3 is the following:
two top indexes used in cyclic way and one main index. When top index
reaches some threshold value
we initiate merge with main index, done by bgworker and switch to
another top index.
As far as merging indexes is done in background, it doesn't affect
insert speed.
Unfortunately Postgres Index AM has not bulk insert operation, so we
have to perform normal inserts.
But inserted data is already sorted by key which should improve access
locality and partly solve random reads problem for base index.

Certainly to perform search in Lsm3 we have to make lookups in all three
indexes and merge search results.
(in case of unique index we can avoid extra searches if searched value
is found in top index).
It can happen that during merge of top and base indexes the same TID can
be found in both of them.
But such duplicates can be easily eliminated during merge of search results.

As far as we are using standard nbtree indexes there is no need to worry
about logging information in WAL.
There is no need to use inefficient "generic WAL records" or patch
kernel by adding own WAL records.

Implementation of Lsm3 Index AM as standart Postgres extension is
available here:
https://github.com/postgrespro/lsm3

I have performed the same benchmark with random inserts (described
above) for Lsm3:

Index Clients TPS
Inclusive B-Tree 1 9387
Inclusive B-Tree 10 18761
RocksDB FDW 1 138350
RocksDB FDW 10 122369
RocksDB 1 166333
RocksDB 10 141482
Lsm3 1 151699
Lsm3 10 65997

Size of nbtree is about 29Gb, single client performance is even higher
than of RocksDB FDW, but parallel results are signficantly worser.
So Lsm3 can provide significant improve of performance for large indexes
not fitting in main memory.
And the larger ratio between index size and RAM size is, the larger
benefit in insertion speed you get.
Lsm3 is just standard postgres extension, fully integrated in Postgres
infrastructure (MVCC, WAL, backups,...).
SO I hope it can be useful when standard indexes becomes bottleneck.

I will be glad to receive any feedback, may be some change requests or
proposals.

Best regards,
Konstantin

Responses

Re: LSM tree for Postgres at 2020-08-04 15:04:40 from Alexander Korotkov
Re: LSM tree for Postgres at 2020-08-04 15:11:36 from Tomas Vondra
Re: LSM tree for Postgres at 2020-08-05 11:09:38 from Dmitry Dolgov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kyotaro Horiguchi	2020-08-04 08:37:08	Re: Is it worth accepting multiple CRLs?
Previous Message	Dave Page	2020-08-04 08:06:40	Re: EDB builds Postgres 13 with an obsolete ICU version