Re: zheap: a new storage format for PostgreSQL

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: zheap: a new storage format for PostgreSQL
Date: 2018-03-03 04:05:28
Message-ID: 42ebe808-b40f-dd90-c815-e074398992ee@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/03/18 05:03, Robert Haas wrote:
> On Fri, Mar 2, 2018 at 5:35 AM, Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>> I would propose "zero-bloat heap" disambiguation of zheap. Seems like fair
>> enough explanation for me without need to rename :)
> It will be possible to bloat a zheap table in certain usage patterns.
> For example, if you bulk-load the table with a ton of data, commit the
> transaction, delete every other row, and then never insert any more
> rows ever again, the table is bloated: it's twice as large as it
> really needs to be, and we have no provision for shrinking it. In
> general, I think it's very hard to keep bulk deletes from leaving
> bloat in the table, and to the extent that it *is* possible, we're not
> doing it. One could imagine, for example, an index-organized table
> that automatically combines adjacent pages when they're empty enough,
> and that also relocates data to physically lower-numbered pages
> whenever possible. Such a storage engine might automatically shrink
> the on-disk footprint after a large delete, but we have no plans to go
> in that direction.
>
> Rather, our assumption is that the bloat most people care about comes
> from updates. By performing updates in-place as often as possible, we
> hope to avoid bloating both the heap (because we're not adding new row
> versions to it which then have to be removed) and the indexes (because
> if we don't add new row versions at some other TID, then we don't need
> to add index pointers to that new TID either, or remove the old index
> pointers to the old TID). Without delete-marking, we can basically
> optimize the case that is currently handled via HOT updates: no
> indexed columns have changed. However, the in-place update has a
> major advantage that it still works even when the page is completely
> full, provided that the row does not expand. As Amit's results show,
> that can hugely reduce bloat and increase performance in the face of
> long-running concurrent transactions. With delete-marking, we can
> also optimize the case where indexed columns have been changed. We
> don't know exactly how well this will work yet because the code isn't
> written and therefore can't be benchmarked, but am hopeful that that
> in-place updates will be a big win here too.
>
> So, I would not describe a zheap table as zero-bloat, but it should
> involve a lot less bloat than our standard heap.
>

For folk doing ETL type data warehousing this should be great, as the
typical workload tends to be like: COPY (or similar) from foreign data
source, then do several sets of UPDATES to fix/check/scrub the
data...which tends to result in huge bloat with the current heap design
(despite telling people 'you can do it another way to' to avoid bloat -
I guess it seems to be more intuitive to just to do it as described).

regards
Mark

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-03-03 04:38:18 Re: non-bulk inserts and tuple routing
Previous Message Amit Kapila 2018-03-03 03:46:13 Re: zheap: a new storage format for PostgreSQL