Re: zheap: a new storage format for PostgreSQL

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: zheap: a new storage format for PostgreSQL
Date: 2018-11-01 06:43:51
Message-ID: CAA4eK1Lwb+rGeB_z+jUbnSndvgnsDUK+9tjfng4sy1AZyrHqRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, May 26, 2018 at 6:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Fri, Mar 2, 2018 at 4:05 PM, Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>
> It's been a while since we have updated the progress on this project,
> so here is an update.
>

Yet, another update.

> This is based on the features that were not
> working (as mentioned in Readme.md) when the branch was published.
> 1. TID Scans are working now.
> 2. Insert .. On Conflict is working now.
> 3. Tuple locking is working with a restriction that if there are more
> concurrent lockers on a page than the number of transaction slots on a
> page, then some of the lockers will wait till others get committed.
> We are working on a solution to extend the number of transaction slots
> on a separate set of pages which exist in heap, but will contain only
> transaction data.
>

Now, we have a working solution for this problem. The extended
transaction slots are stored in TPD pages (those contains only
transaction slot arrays) which are interleaved with regular pages.
For a detailed idea, you can see atop src/backend/access/zheap/tpd.c.
We still have a caveat here which is once the TPD pages are pruned
(the TPD page can be pruned if all the transaction slots are old
enough to matter), they are not added to FSM for reuse. We are
working on a patch for this which we expect to finish in a week or so.

Toast tables are working now, the toast data is stored in zheap.
Apart from having a consistency for storing toast data in the same
storage engine as main data, it has the advantage of early cleanup
which means the space for deleted rows can be reclaimed as soon as the
transaction commits. This is good for toast tables as each update in
toast table is a DELETE+INSERT.

Alignment of tuples is changed such that we don’t have align padding
between the tuple header and the tuple data as we always make a copy
of the tuple to support in-place updates. Likewise, we ideally don't
need any alignment padding between tuples. However, there are places
in zheap code where we access tuple header directly from page (ex.
zheap_delete, zheap_update, etc.) for which we want them to be aligned
at the two-byte boundary). We omit all alignment padding for
pass-by-value types. Even in the current heap, we never point directly
to such values, so the alignment padding doesn’t help much; it lets us
fetch the value using a single instruction, but that is all.
Pass-by-reference types will work as they do in the heap. We can't
directly access unaligned values; instead, we need to use memcpy. We
believe that the space savings will more than pay for the additional
CPU costs.

Vacuum full is implemented in such a way that we don't copy the
information required for MVCC-aware scans. We copy only LIVE tuples
in new heap and freeze them before storing in new heap. This is not a
good idea as we lose all the visibility information of tuples, but
OTOH, the same can't be copied from the original tuple as that is
maintained in undo and we don't have the facility to modify
undorecords. We can either allow to modify undo records or write
special kind of undo records which will capture the required
visibility information. I think it will be tricky to do this and not
sure if it is valuable to put a whole lot of effort without making
basic things work and another thing is that after zheap, the need of
vacuum will anyway be minimized to a good extent.

Serializable isolation is also supported, we don't need to make any
major changes except for making it understand ZheapTuple (used TID in
the required API's). I think this part needs some changes after
integration with pluggable storage API. We have a special handling
for the tuples which are in-place updated or the latest transaction
that modified that tuple got aborted. In that case, we check whether
the latest committed transaction that modified that tuple is a
concurrent transaction. Based on that, we take a decision on whether
we have any serialization conflict.

In zheap, for sub-transactions we don't need to generate new xid as
the visibility information for a particular tuple is present in undo
and on Rollabck To Savepoint, we apply the required undo to make the
state of the tuples as they were before the particular transaction.
This gives us a performance/scalability boost when sub-transactions
are involved as we don't need to acquire XIDGenLock for
subtransaction. Apart from the above benefits, we need this for zheap
as otherwise the undo chain for each transaction won't be linear and
we save allocating additional slots for the each transaction id at the
page level.

Undo workers and transaction rollbacks are working now. My colleague
Dilip has posted a separate patch [1] for this as this can have some
use cases without zheap as well and Thomas has just posted a patch
using that facility.

Some of the other features like row movement for an update of
partition key are also handled.

In short, now most of the user-visible features are working. The make
installcheck for zheap has 12 failures and all are mostly due to the
plan or some stats changes as zheap has additional meta pages (meta
page and TPD pages) and or we have inplace updates. So in most cases
either additional ORDER BY needs to be added or some minor tweak in
the query is required. The isolation test has one failure which again
is due to inplace updates and seems to be a valid case, but needs a
bit more investigation. We have yet to support JIT for zheap, so the
corresponding tests would also fail.

Some of the main things that are not working:
Logical decoding - I am not sure at this stage whether it is a must
for the first version of zheap. Surely, we can have a basic design
ready.
Snapshot too old - This feature allows the data in heap pages to be
removed in presence of old transactions. This is going to work
differently for zheap as we want the undo for older snapshots to
go-away rather than based on heap pages as we do for current heap.
One can argue that we should make it similar to the current heap, but
I see a lot less value in that as this new heap works entirely
differently and we can have a better implementation for that.
Delete marking in indexes - This will allow inplace updates even when
index columns are updated and additionally with this we can avoid the
need for a dedicated vacuum process to perform retail deletes. This
is the feature we definitely want to do separate than the main heap
because current indexes work with zheap without any major changes.

You can find the latest code at https://github.com/EnterpriseDB/zheap

I want to again like to highlight that this all is not alone my work.
Dilip Kumar, Kuntal Ghosh, Rafia Sabih, Mithun C Y and Amit Khandekar
have worked along with me to make this progress.

Feedback is welcome.

[1] - https://www.postgresql.org/message-id/flat/CAFiTN-sYQ8r8ANjWFYkXVfNxgXyLRfvbX9Ee4SxO9ns-OBBgVA(at)mail(dot)gmail(dot)com
[2] - https://www.postgresql.org/message-id/CAEepm%3D0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG%2BXpssKqmezug%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2018-11-01 07:19:45 Re: Parallel threads in query
Previous Message Michael Paquier 2018-11-01 06:26:09 Re: INSTALL file