Re: zheap: a new storage format for PostgreSQL

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: zheap: a new storage format for PostgreSQL
Date: 2018-03-02 10:31:57
Message-ID: CAA4eK1KExs1MR6He_AZi7HQNCd=SK_wWUEKkFGLRrv0nCj79_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 2, 2018 at 1:50 PM, Tsunakawa, Takayuki
<tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com> wrote:
> From: Amit Kapila [mailto:amit(dot)kapila16(at)gmail(dot)com]
>> At EnterpriseDB, we (me and some of my colleagues) are working from more
>> than a year on the new storage format in which only the latest version of
>> the data is kept in main storage and the old versions are moved to an undo
>> log. We call this new storage format "zheap". To be clear, this proposal
>> is for PG-12.
>
> Wonderful! BTW, what "z" stand for? Ultimate?
>

There is no special meaning to 'z'. We have discussed quite a few
names (like newheap, nheap, zheap and some more on those lines), but
zheap sounds better. IIRC, one among Robert or Thomas has come up
with this name.

>
>
> Below are my first questions and comments.
>
> (1)
> This is a pure simple question from the user's perspective. What kind of workloads would you recommend zheap and heap respectively?
>

I think you have already mentioned some of the important use cases for
zheap, namely, update-intensive workloads and probably the cases where
users have long-running queries with updates.

> Are you going to recommend zheap for all use cases, and will heap be deprecated?
>

Oh, no. I don't think so. We have yet not measured zheap's
performance in very many scenarios, so it is difficult to say about
all the cases, but I think eventually Deletes, Updates that update
most of index columns and Rollbacks will be somewhat costlier in
zheap. Now, I think at this stage we can't measure everything because
(a) few things are not implemented and (b) we have not done much on
performance optimization of code.

> I felt zheap would be better for update-intensive workloads. Then, how about insert-and-read-mostly databases like a data warehouse? zheap seems better for that, since the database size is reduced. Although data loading may generate more transaction logs for undo, that increase is offset by the reduction of the tuple header in WAL.
>

We have done optimization where we don't need to WAL-log the complete
undo data as it can be regenerated from page during recovery if
full_page_writes are enabled.

> zheap allows us to run long-running analytics and reporting queries simultaneously with updates without the concern on database bloat, so zheap is a way toward HTAP, right?
>

I think so.

>
> (2)
> Can zheap be used for system catalogs?
>

As of now, we are not planning to support it for system catalogs, as
it involves much more work, but I think if we want we can do it.

>
> (3)
>> Scenario 1: A 15 minutes simple-update pgbench test with scale factor 100
>> shows 5.13% TPS improvement with 64 clients. The performance improvement
>> increases as we increase the scale factor; at scale factor 1000, it
>> reaches11.5% with 64 clients.
>
> What was the fillfactor?
>

Default.

> What would be the comparison when HOT works effectively for heap?
>

I guess this is the case where HOT works effectively.

>
> (4)
> "Undo logs are not yet crash-safe. Fsync and some recovery details are yet to be implemented."
>
> "We also want to make FSM crash-safe, since we can’t count on
> VACUUM to recover free space that we neglect to record."
>
> Would these directly affect the response time of each transaction?
>

Not the first one, but the second one might depend upon on the actual
implementation, but I think it is difficult to predict much at this
stage.

>
> )5)
> "The tuple header is reduced from 24 bytes to 5 bytes (8 bytes with alignment):
> 2 bytes each for informask and infomask2, and one byte for t_hoff. I think we
> might be able to squeeze some space from t_infomask, but for now, I have kept
> it as two bytes. All transactional information is stored in undo, so fields
> that store such information are not needed here."
>
> "To check the visibility of a
> tuple, we fetch the transaction slot number stored in the tuple header, and
> then get the transaction id and undo record pointer from transaction slot."
>
> Where in the tuple header is the transaction slot number stored?
>

In t_infomask2, refer zhtup.h.

>
> (6)
> "As of now, we have four transaction slots per
> page, but this can be changed. Currently, this is a compile-time option; we
> can decide later whether such an option is desirable in general for users."
>
> "The one known problem with the fixed number of slots is that
> it can lead to deadlock, so we are planning to add a mechanism to allow the
> array of transactions slots to be continued on a separate overflow page. We
> also need such a mechanism to support cases where a large number of
> transactions acquire SHARE or KEY SHARE locks on a single page."
>
> I wish for this. I was bothered with deadlocks with Oracle and had to tune INITRANS with CREATE TABLE. The fixed number of slots introduces a new configuration parameter, which adds something the DBA has to be worried about and monitor a statistics figure for tuning.
>

Yeah.

>
> (7)
> What index AMs does "indexes which lack delete-marking support" apply to?
>

Currently, delete-marking is not supported for any of the indexes, but
we are planning to do it for B-tree.

> Can we be freed from vacuum in a typical use case where only zheap and B-tree indexes are used?
>

Depends on what you mean by typical workloads? I think for some
workloads like, when we are inserting monotonically increasing values
and deleting the initial values from index (say someone inserts
11111111111111...2222222222....333333... and then deletes all 1's),
then we might not immediately reclaim space in the index. However, I
don't think we need vacuum per se for such cases, but we will
eventually need some way to clear the bloat in such cases. However, I
think we are still far from there.

>
> (8)
> How does rollback after subtransaction rollback work? Does the undo of a whole transaction skip the undo of the subtransaction?
>

We rewind the undo pointer after rolling back subtransaction, so we
need to just rollback the remaining part.

>
> (9)
> Will the prepare of 2pc transactions be slower, as they have to safely save undo log?
>

I don't think so, for prepared transactions, we need to just save
'from and to' undo record pointer. OTOH, we have not yet measured the
performance of this case.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2018-03-02 10:35:19 Re: zheap: a new storage format for PostgreSQL
Previous Message Kuntal Ghosh 2018-03-02 10:17:04 Re: zheap: a new storage format for PostgreSQL