Enabling deduplication with system catalog indexes

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Enabling deduplication with system catalog indexes
Date: 2021-09-29 18:27:28
Message-ID: CAH2-Wz=rYQHFaJ3WYBdK=xgwxKzaiGMSSrh-ZCREa-pS-7Zjew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

System catalog indexes do not support deduplication as a matter of
policy. I chose to do things that way during the Postgres 13
development cycle due to the restriction on using storage parameters
with system catalog indexes. At the time I felt that *forcing* the use
of deduplication with system catalog indexes might expose users to
problems. But this is something that seems worth revisiting now. (I
haven't actually investigated what it would take to make system
catalogs support the 'deduplicate_items' parameter, but that may not
matter now.)

I would like to enable deduplication within system catalog indexes for
Postgres 15. Leaving it disabled forever seems kind of arbitrary at
best. In general enabling deduplication (or not disabling it) has only
a fixed, small downside in the worst case. It has a huge upside in
favorable cases. Deduplication is part of our high level strategy for
avoiding nbtree index bloat from version churn (non-HOT updates with
several indexes that are never "logically modified"). It effectively
cooperates with and enhances the new enhancements to index deletion in
Postgres 14. Plus these recent index deletion enhancements more or
less eliminated a theoretical downside of deduplication: now it
doesn't really matter that posting list tuples only have a single
LP_DEAD bit (if it ever did). This is because we can now do granular
posting list TID deletion, provided the deletion process visits the
same heap block in passing.

I can find no evidence that even one single user found it useful to
disable deduplication while using Postgres 13 in production (by
searching for "deduplicate_items" on Google). While I myself said that
there might be a regression of up to 2% of throughput back in early
2020, that was under highly unrealistic conditions, that could never
apply to system catalogs -- I was being conservative. Most system
catalog indexes are unique indexes, where there is no possible
overhead from deduplication unless we already know for sure that the
index is subject to some kind of version churn (and so have high
confidence that deduplication will be at least somewhat effective at
buying time for VACUUM). The non-unique system catalog indexes seem
pretty likely to benefit from deduplication in the usual obvious way
(not so much because of versioning and bloat). The two pg_depend
non-unique indexes tend to have a fair number of duplicates.

--
Peter Geoghegan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2021-09-29 18:35:47 Re: Avoiding smgrimmedsync() during nbtree index builds
Previous Message Ranier Vilela 2021-09-29 18:23:43 Re: [BUG] failed assertion in EnsurePortalSnapshotExists()