Re: The Free Space Map: Problems and Opportunities

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Hannu Krosing <hannuk(at)google(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Jan Wieck <jan(at)wi3ck(dot)info>, Gregory Smith <gregsmithpgsql(at)gmail(dot)com>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: The Free Space Map: Problems and Opportunities
Date: 2021-09-08 01:37:30
Message-ID: CAH2-Wz=sSvMX5HWo8d=DFznid0BiTySCm-zhHUsnC4UP-OaXbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 7, 2021 at 12:31 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Doing work in the background has some advantages, though. In
> particular, it has the possibly-large advantage of not slowing down
> foreground work.

What I really like about the idea of doing the work in the foreground
(when it's clearly not going to be too much of a latency hit) is that
it would act as a natural form of backpressure. Sometimes slowing
things down is a good thing -- fundamentally inefficient and
unsustainable use of the system's resources ought to be slowed down by
imposing the cost on the transaction that performs the relevant insert
queries. To me this seems like a problem of negative externalities.
With small transactions we can pretty much know for sure that the
benefit of not leaving pages in good shape at the moment they become
"closed" is fixed and low, while the cost to the system as a whole
over time is potentially far higher. It seems almost unbounded, in a
way, so we're bound to lose in the long run, regardless of workload.
(The work can be piggybacked across transactions to make it more
efficient, but that's mostly an implementation detail.)

Updaters might actually be much less of a problem than inserters. They
already keep costs down by pruning opportunistically -- they at least
try. Most individual logical rows that are inserted have exactly zero
chance of ever being updated, but a ~100% chance of needing to have
hint bits set sooner or later. We know very little about workload
characteristics, but these are some of the few things that we really
do know for sure.

> For me the key insight here is that HOT-pruning a heap page is a lot
> cheaper if you do it before you write the page. Once you've written
> the page, the eventual HOT-prune is going to need to dirty it, which
> will cause it to be written again. If you prune before writing it the
> first time, that cost is avoided.

Right. The difficulty with applying that insight before now has been
making the prune operation take place at the level of whole pages
systematically. I believe that it's quite valuable to set all the hint
bits on a page as we "close it out" for the first time. But setting
hint bits for all of the heap tuples on the page except one or two is
pretty much useless. You might as well not bother at all.

It follows that concurrent inserters clobbering the same page
senselessly are more or less not okay if we're going to pursue this
technique. You might still have a concurrent update of one of the
inserted rows shortly after the original inserting transaction
commits, but before the page is closed -- that might make the idea not
work out for that one page (obviously it depends on the specifics).
That seems like something we can live with, though.

> I'm not sure that it really matters
> whether the space consumed by aborted tuples gets reused by "the very
> next transaction" or, say, 10 transactions after that, or even 1000
> transactions after that. If you wait for a million transactions,
> you've quite possibly lost enough temporality to matter, but at 10 or
> 1000 that's much less likely.

I didn't mean to suggest that it had to happen in perfect lockstep.
But I do think that it should be pretty close to perfect. What I
actually do right now is prune an open page when it *appears* to be
full inside the loop in RelationGetBufferForTuple(). As crude as that
is, it sets hint bits and prunes away aborted transactions in mostly
the right way, at least as far as TPC-C is concerned. (This only works
because it fits into the wider framework of my much larger patch,
which introduces ownership semantics, open and closed pages, empty
page freelists, etc.)

> The exact threshold is fuzzy: every
> moment you wait makes it less likely that you have sufficient
> locality, but you can always find a workload where even a very long
> wait is acceptable, and another one where even a tiny delay is
> catastrophic, and it's hard to say what the "real world" looks like.

I agree that offloading work in the background can be very useful, and
that technique needs to be possible as future work (I actually really
want this myself). But most individual apps probably won't really
notice the latency hit from pruning if it's well managed -- as with
pruning by updates today. We can make a decision about which strategy
to use (lightweight foreground processing with a potential latency
hit, or background batch cleanup?) at a very late point in the cycle,
based on the specifics in each case. Offhand, I don't think that we
need to characterize this fuzzy threshold in too much detail. I
suspect that the main determinative question can be simple enough.
Something like: Does this transaction (or this group of transactions)
involve only trivial changes to maybe 3 - 5 heap pages at the most, or
not?

There is a separate question about what happens if our background
cleanup process cannot keep up over time. I think that we can handle
that by falling back on foreground processing (within reason). Let's
say we have transactions that each insert on to dozens or perhaps even
hundreds of heap pages each -- these are "medium sized'' write
transactions. If we're already doing way too much background
processing for similar transactions from earlier on, and if it's clear
we just can't keep up, maybe we can fall back and force foreground
processing, to keep the debt under control. (In general I think that
this back pressure, negative externalities business will be
important.)

It seems to me that this leaves one harder question unanswered: at
what point does a "medium sized" transaction become so large that it
just doesn't make sense to do either? What's the crossover point at
which background processing and foreground processing like this should
be assumed to be not worth it? That I won't speculate about just yet.
I suspect that at some point it really does make sense to leave it all
up to a true table-level batch operation, like a conventional VACUUM.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-09-08 01:45:26 Re: Possible missing segments in archiving on standby
Previous Message Greg Nancarrow 2021-09-08 01:30:24 Re: Bug in query rewriter - hasModifyingCTE not getting set