From: | Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: zheap: a new storage format for PostgreSQL |
Date: | 2018-03-03 04:05:28 |
Message-ID: | 42ebe808-b40f-dd90-c815-e074398992ee@catalyst.net.nz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 03/03/18 05:03, Robert Haas wrote:
> On Fri, Mar 2, 2018 at 5:35 AM, Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>> I would propose "zero-bloat heap" disambiguation of zheap. Seems like fair
>> enough explanation for me without need to rename :)
> It will be possible to bloat a zheap table in certain usage patterns.
> For example, if you bulk-load the table with a ton of data, commit the
> transaction, delete every other row, and then never insert any more
> rows ever again, the table is bloated: it's twice as large as it
> really needs to be, and we have no provision for shrinking it. In
> general, I think it's very hard to keep bulk deletes from leaving
> bloat in the table, and to the extent that it *is* possible, we're not
> doing it. One could imagine, for example, an index-organized table
> that automatically combines adjacent pages when they're empty enough,
> and that also relocates data to physically lower-numbered pages
> whenever possible. Such a storage engine might automatically shrink
> the on-disk footprint after a large delete, but we have no plans to go
> in that direction.
>
> Rather, our assumption is that the bloat most people care about comes
> from updates. By performing updates in-place as often as possible, we
> hope to avoid bloating both the heap (because we're not adding new row
> versions to it which then have to be removed) and the indexes (because
> if we don't add new row versions at some other TID, then we don't need
> to add index pointers to that new TID either, or remove the old index
> pointers to the old TID). Without delete-marking, we can basically
> optimize the case that is currently handled via HOT updates: no
> indexed columns have changed. However, the in-place update has a
> major advantage that it still works even when the page is completely
> full, provided that the row does not expand. As Amit's results show,
> that can hugely reduce bloat and increase performance in the face of
> long-running concurrent transactions. With delete-marking, we can
> also optimize the case where indexed columns have been changed. We
> don't know exactly how well this will work yet because the code isn't
> written and therefore can't be benchmarked, but am hopeful that that
> in-place updates will be a big win here too.
>
> So, I would not describe a zheap table as zero-bloat, but it should
> involve a lot less bloat than our standard heap.
>
For folk doing ETL type data warehousing this should be great, as the
typical workload tends to be like: COPY (or similar) from foreign data
source, then do several sets of UPDATES to fix/check/scrub the
data...which tends to result in huge bloat with the current heap design
(despite telling people 'you can do it another way to' to avoid bloat -
I guess it seems to be more intuitive to just to do it as described).
regards
Mark
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2018-03-03 04:38:18 | Re: non-bulk inserts and tuple routing |
Previous Message | Amit Kapila | 2018-03-03 03:46:13 | Re: zheap: a new storage format for PostgreSQL |