Re: The Free Space Map: Problems and Opportunities

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Jan Wieck <jan(at)wi3ck(dot)info>, gregsmithpgsql(at)gmail(dot)com, Robert Haas <robertmhaas(at)gmail(dot)com>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: The Free Space Map: Problems and Opportunities
Date: 2021-08-17 22:11:11
Message-ID: CAH2-WzmP-RH8Xfk7bY9OXu9BgL3mtyVWGS3dsddeOa8ThRvQ8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 17, 2021 at 6:18 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I suspect that one other fairly fundamental issue is that we are very
> reticent updating the FSM with information about free space. As long as
> that's the case a smarter placement logic would often not have a better
> place to choose from.

I think so too. But that is made a lot harder by the supposed need to
represent the amount of freespace in FSM_CATEGORIES (255) distinct
increments. As opposed to maybe 4 distinct increments per page, like
in other systems. You only have to update the externally stored
metadata when a threshold is crossed.

(This isn't the main reason why I don't like that; more on that later.)

> I think the design of the current bulk extension mechanism is quite backward
> (it only takes contention into account not other needs for bulk work, it does
> expensive stuff like finding victim buffers under lock, it doesn't release
> locks until all blocks are extended, ...). But: I don't think a bulk
> extension mechanism should be tasked with doing grouping or anything like
> that.

Maybe I'm just drawing the boundaries differently in my head, which
doesn't seem like a real difference of opinion.

> > Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
> > ever end up inserting *any* new logical rows on a heap page after the
> > page first fills -- it is initially "open" when first allocated, and
> > then quickly becomes "closed" to inserts of new logical rows, once it
> > crosses a certain almost-empty space threshold.

> I suspect that we'd get a *LOT* of complaints if we introduced that degree of
> stickiness.

Maybe. It would be very different to our current behavior, which would
undoubtedly have many consequences. Maybe this will be a big problem,
maybe it barely matters. I make no specific claims about it, either
way. Not just yet.

> Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
> from a fairly granular FSM_CATEGORIES, while still achieving better
> locality. You could e.g. search for free space for an out-of-page update in
> nearby pages with a lower free-percentage threshold, while having a different
> threshold and selection criteria for placement of inserts.

Again, it's all about *systemic* effects. A FSM request is basically a
choice *among* blocks that could in principle satisfy the request --
often the number of basically-satisfactory blocks is huge (way too
large, even). You have to bear in mind that concurrent requests are in
competition with each other in much of the time. They are usually all
looking for exactly the same thing (e.g., same free space
requirement), or close to the same thing -- tuples within a given
table tend to all be about as wide as each other.

I can clearly sometimes see very high contention cases, where backends
do way too much spinning inside the RelationGetBufferForTuple() loop.
They're all more or less chasing the same scraps of free space, in a
highly inefficient way. Even though there is probably an abundance of
free space. Right now the heap pages that have the most free space are
also the ones that are least likely to be used.

Though I think that these backends should cooperate more, some amount
of competition is probably necessary. If FSM_CATEGORIES used 3 bits
instead of 8, then the backends could fall back on caring about some
other thing that distinguished heap pages from each other that
actually matters. Leading to less contention, and maybe better space
utilization.

I have presented this FSM_CATEGORIES contention issue as a distinct
problem to the first one (the locality problem), though honestly that
was a bit arbitrary -- it's just blurry. Thank you for putting up with
that.

> The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
> merging FSM and VM is a good direction to go to. Additionally we need to make
> sure that the VM is accurately durable, which isn't the case for the FSM
> (which I think we should use to maintain it more accurately).

I explained this part badly. I don't think that we should literally
unite the VM and FSM data structures. I was really just pointing out
that a logical-ish notion of a page being full (to use the term of art
I've seen, a "closed" page) is almost the same thing as a page that is
marked all-visible/all-frozen. Presumably VACUUM does that with the
optimistic assumption that it will probably never again need to touch
the same page. And while that can never be guaranteed, it certainly
seems like we might want to weigh things in favor of that happening.

I'm pretty sure that carelessly remembering 200 bytes of free space in
the FSM for an all-frozen page is practically guaranteed to be a bad
idea. While I don't have a great idea of where the break-even point is
just yet, I am prepared to say that I'm sure that it's not 0 bytes,
which is how it works today. A squishy logical definition of "page is
full" seems pretty natural to me.

> But perhaps I'm just missing something here. What is the strong interlink
> between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM
> that the main linkage is on the side of inserts/update that choose a target
> page from the FSM. Isn't it better to make that side smarter?

I'll respond to Robert about this separately.

> To me this seems like it'd be better addressed by a shared, per-relfilenode,
> in-memory datastructure. Thomas Munro has been working on keeping accurate
> per-relfilenode relation size information. ISTM that that that'd be a better
> place to hook in for this.

Actually, that's the approach that I'm taking in my current
prototyping. I would like to have a WAL-logged component to this as
well, but I haven't got that far.

I think that it's possible that I'll eventually conclude that the
whole idea of a FSM is just the wrong idea. Remembering anything at
all about most pages may just be unnecessary in a world where we fix
everything. Maybe the end result is that the common case is that most
individual heap pages are "closed". Perhaps that can just be implicit
-- there is no information to remember.

> > would be quite natural for the leader backend to say to each individual
> > follower backend: thank you for waiting for me, here is a big contiguous
> > range of heap pages (e.g., as many as 50 heap pages or more per xact when
> > things truly ramp up), which should be enough to allow you to avoid going to
> > the FSM again for quite a long time.
>
> That'd end up badly if we did for a relation that's hotly inserted into by a
> lot of connections, but where each connection only inserts a few rows. I
> suspect we'd either need information from the inserting backends about the
> number of pages they're likely going to need (e.g. a smarter "extension queue"
> where each entry lists the number of pages requested) and/or stats about the
> number of consecutive insertions a relation typically gets.

That's a part of it, certainly. The heuristics would mostly be driven
by recent information. In particular, we would probably only ramp
things up based on a clear observed failure to keep up with demand. We
could be very aggressive indeed this way, without too much extra risk.
We'd only ramp up because we *knew* that the most recent batch/set of
free lists for the last bulk request was inadequate.

For that you need the context, the recent history.

> > But the leader backend doesn't do that at all -- it just puts all of
> > the pages (up to 512 total) in the FSM. All of the waiting follower
> > backends are left to senselessly fight it out for free space by simply
> > going back to the FSM when they wake up again. Why can't they
> > cooperate intelligently? What stops that today?
>
> The lack of suitable in-memory coordination facilities.
>
> It turns out that I looked a bunch at this area because of AIO + DIO: With DIO
> the typical linux filesystems (ext4, xfs) do not perform any delayed
> allocation, which means that the non-bulk extension leads to *horrible*
> filesystem fragmentation.

It's the coordination facilities -- agreed. The only way that I seem
to be going further than what you've said here is the logical database
stuff. That is, I intend to specifically involve metadata like XIDs,
as well as a general understanding of things like commit/abort
boundaries for said XIDs. Ownership of a free space lease shouldn't be
strictly tied to an XID, but that seems like a good starting point to
me. This is the kind of thing that an OS engineer would never think
of, that other database systems seem to do a lot of.

This design involving waiting on XIDs could also be reused for nbtree
page deletion, where page recycling is deferred (same with GiST page
deletion). We could model this problem as: the
physically-deleted-by-VACUUM page's safexid value should be treated as
"the maybe-still-ongoing transaction that owns this page". This is
kind of a lie, but close enough to the truth. That way we can put
every deleted page in the FSM/whatever immediately, while reusing the
same XID-wait stuff to solve the recycle safety problem.

We push the recycle safety stuff on to the consumer side this way.
This is currently a problem that mostly belongs to VACUUM itself (the
producer side) -- which makes zero sense. I think that the only reason
we do it that way is because the current FSM doesn't care at all about
XID boundaries. It's almost the same problem, maybe even exactly the
same.

> I addressed (or rather sidestepped) this to a small degree in the AIO+DIO
> patchset, by making extension faster and doing a few improvements around
> locking. But it's not good enough.
>
> If the relation extension infrastructure instead had information about the
> number of blocks each waiter wants and we had Thomas' "accurate shared
> relation size" infrastructure, we could do a lot better:

Sounds great to me!

> 1) Whenever a backend needs new space and most of the previously bulk-extended
> space is used opportunistically try to bulk-extend. In the happy path this
> avoids the thundering herd issue because other backends can continue
> inserting while extension is in progress.

Right -- my thoughts exactly.

> 2) Whenever bulk extending, release pages to individual backends by looking in
> the queue of backends needing extension and wake up backends individually
> whenever we've bulk extended sufficiently for that backend.

This is also exactly what I was thinking of, albeit for slightly
different reasons.

I want to more or less say directly to each backend at the point it
wakes up: here you go, here are those pages you asked for. Somebody
might check in about this free space lease later on, but in general
you can think of these pages as your exclusive property.

Making it backend/XID specific is probably helpful in many ways. Like
maybe there is a natural variation in demand among backends, even
though they're all bulk inserting into the same table. Who knows?

> I think the fundamental issue of bulk extension issue is more that bulk
> extension shouldn't actually get pages from the FSM at all. That's just an
> ugly implementation detail of the hack we have currently, rather than
> something that should live on.

Actually, I agree. I was just trying to use our existing terms to
introduce my ideas.

> We may want to decide to continue inserting
> bulk extended pages into the FSM to handle edge cases of backends erroring out
> before space is used, but the happy path should not touch the FSM at all.

You don't have to touch the FSM if you don't even have one in the
first place. :-)

> It sounds like you're thinking of using some global history, or at least a
> backend local history? I think it might be better to have backend local,
> per-command state. E.g. for a COPY aggressively increase the amount of space
> reserved, but not for a single-row insert.

That seems quite possible. Not sure just yet.

This feels like the right general idea to me, even though many of the
details are not worked out.

> I guess it depends on what you define as recent, but I don't see a way to
> bound the timeframe usefully. We need to deal with cases where table sizes go
> up and down over time, e.g. because there's periodic batch activity and
> ongoing activity. What's the time bound for unused space somewhere in a
> relation? I don't see any, unless we add compaction logic to vacuum.

I don't want to bound the timeframe, exactly. I want to look at what
happened during the last bulk operation, and compare that to what I
see for the current/ongoing operation -- which could involve a fair
amount of details. The idea is to make decisions based on inferences
about what is *changing*. What's working, what's not working -- the
*direction* of things is at least as important as the state of things
at this exact point in time.

This does mean that you have to accept that some number of heap pages
can be "leaked" if inserts against a table that was being
bulk-extended suddenly go away completely. Of course it should be
possible to get this so-called leaked space back, just as soon as the
table receives more inserts (any inserts should do that). My current
best guess is that this will work fine in practice.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message alvherre@alvh.no-ip.org 2021-08-17 22:32:08 Re: archive status ".ready" files may be created too early
Previous Message Bossart, Nathan 2021-08-17 21:44:57 Re: archive status ".ready" files may be created too early