Preventing free space from being reused

From: Noah Bergbauer <noah(at)statshelix(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Preventing free space from being reused
Date: 2021-02-12 14:41:37
Message-ID: CABjy+RhbFu_Hs8ZEiOzaPaJSGB9jqFF0gDU5gtwCLiurG3NLjQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

I am working on a system to store and analyze JSON-based log data. The idea
is to store tuples with a monotonically increasing timestamp (i.e.
CURRENT_TIMESTAMP) in a table and run some queries that focus on specific
time windows (~one hour). It seems to me that a BRIN index is designed
exactly for this type of application.

However, I find that a BRIN index with standard settings (pages_per_range =
128) with very simple filters (e.g. ts > (CURRENT_TIMESTAM - INTERVAL '1
hour')) causes ~20x more pages to be scanned than I would expect. A
majority of these tuples is then discarded due to index rechecking.
Inspecting the BRIN pages using the pageinspect extension reveals the
following problem: it seems that if a page is left with some free space
that can not be filled right away (because the tuples being inserted are
too large), then this hole can be filled at some arbitrary later point in
time (sometimes hours later) once a small enough tuple comes along. This
substantially reduces the BRIN index's effectiveness.

I confirmed this theory by CLUSTERing the table using a temporary btree
index. Suddenly the query performance exactly matched my estimates.

The JSON data structure is often similar, so the table is stored on ZFS
with compression. Hence, filling these holes brings no benefit - they were
taking up no space thanks to the compression. On the other hand, rewriting
old pages in order to fill up holes also creates a lot of fragmentation
because of ZFS's copy-on-write semantics.

In summary, all of these problems would be solved if there was some way to
stop the database system from ever reusing free space.

Bonus question: what's the best TOAST configuration for this use case? Is
there any point in out-of-line storage when all tuples are going to be
quite large (i.e. > 1kB)? Is there any benefit in having postgresql
compress the data when ZFS runs compression regardless?

Thank you,
Noah Bergbauer

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2021-02-12 15:14:17 Re: kernel.shmmax and kernel.shmall for Linux server
Previous Message Guy Burgess 2021-02-12 12:54:56 Re: PostgreSQL occasionally unable to rename WAL files (NTFS)