zheap: a new storage format for PostgreSQL

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: zheap: a new storage format for PostgreSQL
Date: 2018-03-01 14:09:04
Message-ID: CAA4eK1+YtM5vxzSM2NZm+pC37MCwyvtkmJrO_yRBQeZDp9Wa2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sometime back Robert has proposed a solution to reduce the bloat in
PostgreSQL [1] which has some other advantages of its own as well. To
recap, in the existing heap, we always create a new version of a tuple on
an update which must eventually be removed by periodic vacuuming or by
HOT-pruning, but still in many cases space is never reclaimed completely.
A similar problem occurs for tuples that are deleted. This leads to bloat
in the database.

At EnterpriseDB, we (me and some of my colleagues) are working from more
than a year on the new storage format in which only the latest version of
the data is kept in main storage and the old versions are moved to an undo
log. We call this new storage format "zheap". To be clear, this proposal
is for PG-12. The purpose of posting this at this stage is that it can
help as an example to be integrated with pluggable storage API patch and to
get some early feedback on the design. The purpose of this email is to
introduce the overall project, however, I think going forward, we need to
discuss some of the subsystems (like Indexing, Tuple locking, Vacuum for
non-delete-marked indexes, Undo Log Storage, Undo Workers, etc. ) in
separate threads.

The three main advantages of this new format are:
1. Provide better control over bloat (a) by allowing in-place updates in
common cases and (b) by reusing space as soon as a transaction that has
performed a delete or non-in-place-update has committed. In short, with
this new storage, whenever possible, we’ll avoid creating bloat in the
first place.

2. Reduce write amplification both by avoiding rewrites of heap pages (for
setting hint-bits, freezing, etc.) and by making it possible to do an
update that touches indexed columns without updating every index.

3. Reduce the tuple size by (a) shrinking the tuple header and (b)
eliminating most alignment padding.

You can check README.md in the project folder [1] to understand how to use
it and also what are the open issues. The detailed design of the project is
present at src/backend/access/zheap/README. The code for this project is
being developed in Github repository [1]. You can also read about this
project from Robert's recent blog [2]. I have also added few notes on
integration with pluggable API on zheap wiki page [3].

Preliminary performance results
-------------------------------------------

*We’ve shown the performance improvement of zheap over heap in a few
different pgbench scenarios. All of these tests were run with data that
fits in shared_buffers (32GB), and 16 transaction slots per zheap page.
Scenario-1 and Scenario-2 has used synchronous_commit = off and Scenario-3
and Scenario-4 has used synchronous_commit = onScenario 1: A 15 minutes
simple-update pgbench test with scale factor 100 shows 5.13% TPS
improvement with 64 clients. The performance improvement increases as we
increase the scale factor; at scale factor 1000, it reaches11.5% with 64
clients.Scale FactorHEAPZHEAP (tables)*ImprovementBefore test1001281 MB1149
MB-10.3%100013 GB11 GB-15.38%After test1004.08 GB3 GB-26.47%100015 GB12.6
GB-16%* The size of zheap tables increase because of the insertions in
pgbench_history table.Scenario 2: To show the effect of bloat, we’ve
performed another test similar to the previous scenario, but a transaction
is kept open for the first 15 minutes of a 30-minute test. This restricts
HOT-pruning for the heap and undo-discarding for zheap for the first half
of the test. Scale factor 1000 - 75.86% TPS improvement for zheap at 64
client count. Scale factor 3000 - 98.18% TPS improvement for zheap at 64
client count.Scale FactorHEAPZHEAP (tables)*ImprovementAfter test100019
GB14 GB-26.3%300045 GB37 GB-17.7%* The size of zheap tables increase
because of the insertions in pgbench_history table.The reason for this huge
performance improvement is that when the long-running transaction gets
committed after 900 seconds, autovacuum workers start working and degrade
the performance of heap for a long time. In addition, the heap tables are
also bloated by a significant amount. On the other hand, the undo worker
discards the undo very quickly, and we don't have any bloat in the zheap
relations. In brief, zheap clusters the bloats in undo segments. We just
need to determine the how much undo can be discarded and remove it, which
is cheap.Scenario 3: A 15 minutes simple-update pgbench test with scale
factor 100 shows 6% TPS improvement with 64 clients. The performance
improvement increases as we increase the scale factor to 1000 achieving
11.8% with 64 clients.Scale FactorHEAPZHEAP (tables)*ImprovementBefore
test1001281 MB1149 MB-10.3%100013 GB11 GB-15.38%After test1002.88 GB2.20
GB-23.61%100013.9 GB11.7 GB-15.8%* The size of zheap tables increase
because of the insertions in pgbench_history table.Scenario 4: To amplify
the effect of bloats in scenario 3, we’ve performed another test similar to
scenario, but a transaction is kept open for the first 15 minutes of a 30
minute test. This restricts HOT-pruning for heap and undo-discarding for
zheap for the first half of the test.Scale FactorHEAPZHEAP
(tables)*ImprovementAfter test100015.5 GB12.4 GB-20%300040.2 GB35 GB-12.9%*
Pros
--------
1. Zheap has better performance characteristics as it is smaller in size
and it has an efficient mechanism to discard undo in the background which
is cheaper than HOT-pruning.
2. The performance improvement is huge in cases where heap bloats and
zheap bloats
the undo.
3. We will also see a good performance boost for the cases where UPDATE
statement updates few indexed columns.
4. The system slowdowns due to Vacuum (or Autovacuum) would be reduced to a
great extent.
5. Due to fewer rewrites of the heap (like is no freezing, hot-pruning,
hint-bits etc), the overall writes and the WAL volume will be lesser.

Cons
-----------
1. Deletes can be somewhat expensive.
2. Transaction aborts will be expensive.
3. Updates that update most of the indexed columns can be somewhat
expensive.

Credits
------------
Robert did much of the basic design work. The design and development of
various subsystems of zheap have been done by a team comprising of me,
Dilip Kumar, Kuntal Ghosh, Mithun CY, Ashutosh Sharma, Rafia Sabih, Beena
Emerson, and Amit Khandekar. Thomas Munro wrote the undo storage system.
Marc Linster has provided unfailing management support, and Andres Freund
has provided some design input (and criticism). Neha Sharma and Tushar
Ahuja are helping with the testing of this project.

[1] - https://github.com/EnterpriseDB/zheap
[2] - http://rhaas.blogspot.in/2018/01/do-or-undo-there-is-no-vacuum.html
[3] - https://wiki.postgresql.org/wiki/Zheap#Integration_with_
Pluggable_Storage_API

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2018-03-01 14:13:48 Re: Re: Cast jsonb to numeric, int, float, bool
Previous Message David Steele 2018-03-01 13:55:08 Re: 2018-03 Commitfest starts tomorrow