Re: The Free Space Map: Problems and Opportunities

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Jan Wieck <jan(at)wi3ck(dot)info>, Gregory Smith <gregsmithpgsql(at)gmail(dot)com>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: The Free Space Map: Problems and Opportunities
Date: 2021-08-23 21:54:51
Message-ID: CAH2-Wz=zEV4y_wxh-A_EvKxeAoCMdquYMHABEh_kZO1rk3a-gw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 20, 2021 at 1:30 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Aug 20, 2021 at 3:40 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > My concern here is really the data structure and its overall
> > complexity.

> I don't think I understand the data structure that you have in mind
> well enough to comment intelligently.

Right now my prototype has a centralized table in shared memory, with
a hash table. One entry per relation, generally multiple freelists per
relation. And with per-freelist metadata such as owner and original
leader backend XID values. Plus of course the lists of free blocks
themselves. The prototype already clearly fixes the worst problems
with BenchmarkSQL, but that's only one of my goals. That's just the
starting point.

I appreciate your input on this. And not just on the implementation
details -- the actual requirements themselves are still in flux. This
isn't so much a project to replace the FSM as it is a project that
adds a new rich abstraction layer that goes between access methods and
smgr.c -- free space management is only one of the responsibilities.

Access methods like heapam can talk to the new abstraction layer,
sometimes exploiting semantics from the logical database -- owning
transaction IDs for a given lease, things like that. The new
abstraction may thereby help us when working on other problems,
including problems in areas that might seem totally unrelated. That's
the big idea. Here are some (admittedly quite speculative) examples of
places that the infrastructure could help with:

* Parallel INSERT/COPY.

I imagine parallel DML is hard because of coordination problems. With
the infrastructure I have in mind, a parallel DML implementation could
request a large free space lease, and distribute blocks from that
lease among parallel workers during execution, on demand. This
naturally avoids unnecessary buffer lock contention among concurrent
writes from the same transaction (but different workers).

But that's not all. I'm also thinking of things like LWLock/buffer
lock deadlock avoidance -- issues of fundamental correctness. That
stuff becomes easier, too. Individual parallel workers from a parallel
INSERT could safely assume that the blocks that the parallel leader
tells them about really is only going to be accessed by them, for as
long as the owning transaction is around (barring read-only access by
VACUUM, things like that). There is simply no question of any other
backend/parallel worker thinking that it has the right to use the same
block before the transaction commits.

The transaction-scoped ownership semantics must be truly robust for
this to work at all, but that's a good idea anyway.

* Truncate tables without a disruptive relation-level lock.

AFAICT it isn't actually logically necessary to get a heavyweight
relation extension lock to truncate a relation during VACUUM. The only
reason we need that right now is because it's really hard to nail down
which backends might have a usable BlockNumber that they still expect
to not point past the end of the physical relfilenode (or much worse,
point to new and logically unrelated data). Perhaps we can solve the
underlying problem using something like Lanin & Shasha's "drain
technique", which is used inside nbtree for page deletion today.

The general idea is to defer the physical unlink/recycling until we
know for sure that any backends that might have had unsafe references
have gone away. This is a very broad technique.

* Truncate-away empty heap table segments that happen to come before
other segments that still have useful data.

* Eager cleanup of aborted transactions by VACUUM (or some variation
of VACUUM that just handles aborts).

Again, we're using information about the logical database to solve
problem with the physical database -- that information is what we seem
to currently be missing in all these areas. That's really the central
idea behind the new infrastructure.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-08-23 22:52:17 prevent immature WAL streaming
Previous Message Stephen Frost 2021-08-23 21:12:25 Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)