Re: LSM tree for Postgres

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: LSM tree for Postgres
Date: 2020-08-04 15:11:36
Message-ID: 20200804151136.3utkewh4ydcngiui@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote:
>Hi hackers,
>I want to share results of my last research of implementing LSM index
>in Postgres.
>Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using
>LSM tree instead of classical B-Tree.

I was under the impression that LSM is more an alternative primary
storage, not for indexes. Or am I wrong / confused?

>From one side, capacity of RAM at modern servers allows to keep the
>whole database in memory.
>It leads to the anti-caching approach proposed by Michael Stonebraker
>From the other side: maintaining if indexes is one of the most
>important factor limiting database performance.
>Postgres is able to insert records in table without indexes almost
>with linear disk write speed(hundred megabytes per second).
>But if table contains indexes, inserted keys have random values and
>indexes don't fill in memory then we observer dramatic degradation of
>performance. Average HDD access time is about 10msec, which
>corresponds to 100 random reads per second. If table has several
>and is large enough not to fit in memory, then insert speed can be as
>low as tens TPS. Certainly SSD can provide much better random access
>but still random reads are slow.

True. Indexes (the way we do them) almost inevitably cause random I/O.

>LSM approach tries to address this problem.
>First of all I started my experiments with RocksDB (may be most
>popular LSM-based key-value storage, used in many databases).
>There was FDW for RocksDB from VidarDB project:
>As far as RocksDB is multuthreaded embedded database engine and
>Postgres is based on multiprocess architecture,
>them used interesting approach "server inside server": there is
>bgworker process which works with RocksDB and
>backends sendind requests to it through shared memory queue.
>I have significantly rewriten their FDW implementation: original
>RocksDB server implementation was single threaded.
>I have made it multitheaded making it possible to run multiple RocksDB
>requests in parallel.
>My implementation can be found there:
>Some results of benchmarking.
>Benchmark is just insertion of 250 millions of records with random key
>in inclusive index containing one bigint key and 8 bigint fields.
>SIze of index is about 20Gb and target system has 16GB of RAM:
>Index Clients TPS
>Inclusive B-Tree 1 9387
>Inclusive B-Tree 10 18761
>RocksDB FDW 1 138350
>RocksDB FDW 10 122369
>RocksDB 1 166333
>RocksDB 10 141482

Interesting, although those are just writes, right? Do you have any
numbers for read? Also, what are the numbers when you start with "larger
than RAM" data (i.e. ignoring the initial period when the index fits
into memory)?

>As you can see there is about 10 times difference.
>May be somebody will find useful this idea of using IOT (index
>organized table) based on RocksDB  in Postgres.
>But this approach violates all ACID properties of Postgres:
>there is no atomicity and consistency (in principle RocksDB supports
>2PC, but it is not used here),
>isolation corresponds to something like "read uncommitted",
>and  concerning durability - it is all up to RocksDB and I have
>serious doubts that it will survive failure especially with sync write
>mode disabled.
>So I considered this project mostly as prototype for estimating
>efficiency of LSM approach.

Yeah, I think in general FDW to a database with different consistency
model is not going to get us very far ... Good for PoC experiments, but
not really designed for stuff like this. Also, in my experience FDW has
siginficant per-row overhead.

>Then I think about implementing ideas of LSM using standard Postgres nbtree.
>We need two indexes: one small for fast inserts and another - big
>(main) index. This top index is small enough to fit in memory
>so inserts in this index are very fast.
>Periodically we will merge data from top index to base index and
>truncate the top index. To prevent blocking of inserts in the table
>while we are merging indexes we can add ... on more index, which will
>be used during merge.
>So final architecture of Lsm3 is the following:
>two top indexes used in cyclic way and one main index. When top index
>reaches some threshold value
>we initiate merge with main index, done by bgworker and switch to
>another top index.
>As far as merging indexes is done in background, it doesn't  affect
>insert speed.
>Unfortunately Postgres Index AM has not bulk insert operation, so we
>have to perform normal inserts.
>But inserted data is already sorted by key which should improve access
>locality and partly solve random reads problem for base index.
>Certainly to perform search in Lsm3 we have to make lookups in all
>three indexes and merge search results.
>(in case of unique index we can avoid extra searches if searched value
>is found in top index).
>It can happen that during merge of top and base indexes the same TID
>can be found in both of them.
>But such duplicates can be easily eliminated during merge of search results.
>As far as we are using standard nbtree indexes there is no need to
>worry about logging information in WAL.
>There is no need to use inefficient "generic WAL records" or patch
>kernel by adding own WAL records.

Makes sense, I guess. I always imagined we could do something like this
by adding "buffers" into the btree directly, and instead of pushing them
all the way down to the leaf pages we'd only insert them into the first
buffer (and full buffers would get "propagated" in the background).

How many such "buffers" we'd need / in which places in the btree is an
open question - I suppose we could have a buffer for every internal page
Typical indexes have ~1% in internal pages, so even if each 8kB internal
page has an associated 8kB buffer page, it's not going to increase the
size significantly. Of course, it's going to make lookups more expensive
because you have to search all the buffers on the way to the leaf page
(I wonder if we could improve this by keeping a a tiny bloom filters for
those buffers, representing data in the subtree).

Not sure if this would be simpler/cheaper than maintaining multiple
separate indexes, which is what you propose.

BTW how would your approach work with unique indexes, speculative
inserts etc.?

>Implementation of Lsm3 Index AM as standart Postgres extension  is
>available here:
>I have performed the same benchmark with random inserts (described
>above) for Lsm3:
>Index Clients TPS
>Inclusive B-Tree 1 9387
>Inclusive B-Tree 10 18761
>RocksDB FDW 1 138350
>RocksDB FDW 10 122369
>RocksDB 1 166333
>RocksDB 10 141482
>Lsm3 1 151699
>Lsm3 10 65997
>Size of nbtree is about 29Gb, single client performance is even higher
>than of RocksDB FDW, but parallel results are signficantly worser.
>So Lsm3 can provide significant improve of performance for large
>indexes not fitting in main memory.
>And the larger ratio between index size and RAM size is, the larger
>benefit in insertion speed you get.
>Lsm3 is just standard postgres extension, fully integrated in Postgres
>infrastructure (MVCC, WAL, backups,...).
>SO I hope it can be useful when standard indexes becomes bottleneck.

Isn't it a bit suspicious that with more clients the throughput actually
drops significantly? Is this merely due to PoC stage, or is there some
inherent concurrency bottleneck?


Tomas Vondra
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-08-04 15:17:43 Re: WIP: BRIN multi-range indexes
Previous Message Justin Pryzby 2020-08-04 15:11:17 Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)