Re: The Free Space Map: Problems and Opportunities

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Jan Wieck <jan(at)wi3ck(dot)info>, Gregory Smith <gregsmithpgsql(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: The Free Space Map: Problems and Opportunities
Date: 2021-09-06 21:58:53
Message-ID: CAH2-Wzm-VhVeQYTH8hLyYho2wdG8Ecrm0uPQJWjap6BOVfe9Og@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> This is actually quite simple -- the chaos that follows is where it
> gets truly complicated (and not necessarily worth getting into just
> yet). The update of a given order and its order lines entries takes
> place hours later, within a delivery transaction.

Update on this: aborted transactions play an important role in the
chaos that we see with BenchmarkSQL/TPC-C.

This is surprising, in one way. Only 1% of all transactions are
aborted (per the TPC-C spec). But in another way it's not surprising.
It's totally consistent with what I've been saying about open and
closed pages. When a page that's initially allocated and opened by a
backend becomes closed following some inserts, the page should be left
in a more or less pristine state -- owning backends that open and
close pages need to be "good citizens.

The definition of "pristine" must also include "no already-aborted
tuples" -- aborted xact tuples are a problem that we can and should
nip in the bud. This is really an extension of what I've been saying
about not disrupting naturally occuring temporal and spatial locality.
We're far better off if the space reused by freeing aborted heap
tuples is reused by the very next transaction running in the very same
backend. Plus it's not that hard; we can know for sure that this space
is immediately available, which is not generally true of pruning. And
so this free space from aborted xact tuples is quite special.

Even with the master branch the problems with the FSM are much less
noticeable once you configure BenchmarkSQL to not abort any
transactions [1]. You need to consider the whole picture to understand
how the issue of aborted transactions has such an outsized negative
impact on performance and space efficiency.

It's a whole *chain* of events:

* If just 1% of all transactions abort, that's enough to leave an
average of about 1 aborted tuple on every "order" table page (which
naturally has small heap tuples [2]), and about 10 on a minority of
the really big "new order" table's pages. Maybe 1 in 10 or 12 "order
line" table heap pages are affected in this way.

* The overall impact is that small (but not tiny) pockets of free
space are left randomly strewn all over the place. Not just LP_DEAD
line pointer stubs -- whole entire aborted heap tuples.

* In theory we could address this problem using opportunistic pruning.
But in practice that simply cannot happen today, because we don't
touch the inserted rows until literally hours later.

The same rows will eventually have updates and selects, but that won't
help -- too little, too late. That's just how opportunistic pruning
works: currently you need some kind of scan node for opportunistic
pruning to kick in, so continually inserting transactions (from new
orders) are currently fundamentally unable to do any opportunistic
pruning. Much less opportunistic pruning that kicks in at exactly the
right time.

* When an autovacuum worker runs against these tables, it will of
course notice this diffuse free space from aborted tuples, and put it
in the FSM.

That free space will be reused, but by totally unrelated logical rows.

What we really seem to need here is some rudimentary form of "eager
physical rollback"; the really important thing seems to be to not
allow the problem to happen in the first place. I have a basic
prototype implementation, built on top of the larger prototype patch.
It opportunistically prunes-away aborted heap tuples on target/open
pages, bypassing the usual pd_prune_xid check we have in
heap_page_prune_opt() -- that doesn't make sense with the new form of
targeted, abort-orientated pruning. The new pruning mechanism makes a
really big difference on its own, without even involving the FSM. The
best cure for all kinds of bloat is prevention, which this comes
pretty close to.

I think that this general idea can be pushed further. I've talked
about aborts, but what about the commit path? I don't see why we can't
make inserting backends responsible for setting hint bits on recently
inserted rows -- the cost of putting off that work can only go up, no
matter what (unless maybe the user drops the table). Again, the
missing piece seems to be a general sense of ownership of heap pages
by transactions and/or backends. In order for a backend to be able to
clean up its own mess after itself while it's still cheap, it has to
understand what that actually means.

Don't forget that the TPC-C workload doesn't have that many updates
(at least not for the tables I'm focussed on here). Nothing that we
tend to think of as an adversarial case for Postgres, except perhaps
the updates against the "order" table, which are always non-HOT
updates due to an issue with the indexes. That's the extent of it,
though -- every other table easily has 99%+ of all updates use HOT
(albeit with some heap fill factor tuning). We should *expect* long
term stability here, based on the particulars of the workload; we only
need to update every "order" row and every "order lines" row exactly
once. After that they can just age out of shared_buffers forever
(actually they might be read again, but never modified again, but
reads were never the problem). The big "order line" table is by far
the biggest problem, even though it manages to use HOT for practically
all updates.

[1] https://github.com/wieck/benchmarksql/commit/66b8db073545026dc76ef513d2b0e318d2f3d5a2
[2] https://apps.dtic.mil/dtic/tr/fulltext/u2/a264793.pdf -- "Table 1:
Summary of Logical Database"
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message thomas 2021-09-06 22:21:13 Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
Previous Message Tomas Vondra 2021-09-06 21:03:50 Re: Column Filtering in Logical Replication