Re: Problems with the FSM, heap fillfactor, and temporal locality

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Problems with the FSM, heap fillfactor, and temporal locality
Date: 2020-08-25 02:17:16
Message-ID: CAH2-Wz=QL1eNm+iOh8Eto3SDZCBnV7Qi0VL-cyuPTPkBRrMJdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 24, 2020 at 6:38 AM John Naylor <john(dot)naylor(at)2ndquadrant(dot)com> wrote:
> On Fri, Aug 21, 2020 at 8:53 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > Note that there is a ~20% reduction in blks_hit here, even though the
> > patch does ~1% more transactions (the rate limiting doesn't work
> > perfectly). There is also a ~5.5% reduction in aggregate
> > blk_read_time, and a ~9% reduction in blk_write_time. I'd say that
> > that's pretty significant.
>
> Indeed.

Most of this seems to come from the new_orders table, which has heap
pages that are continually inserted into and then deleted a little
later on. new_orders is a bit like a queue that never gets too big. It
is probably the component of TPC-C where we have the most room for
improvement, fragmentation-wise. OTOH, despite all the churn the high
watermark size of the new_orders table isn't all that high -- maybe
~50MB with a 1TB database. So it's not like we'll save very many
shared_buffers misses there.

> > Preferring close-by blocks in the FSM is something that there was
> > discussion of when the current FSM implementation went in back in 8.4.
>
> Right, I found a pretty long one here:
>
> https://www.postgresql.org/message-id/flat/1253201179.9666.174.camel%40ebony.2ndQuadrant

Thanks for finding that.

> I imagine you're suggesting to make this change in the FSM data? I'm
> thinking we could change the category byte to a signed integer, and
> reduce FSM_CATEGORIES to 128.

Yeah, something like that. I don't think that we need very many
distinct FSM_CATEGORIES. Even 128 seems like way more than could ever
be needed.

> Any change of the FSM file would require pg_upgrade to rewrite the
> FSM, but it still doesn't seem like a lot of code.

I think that the sloppy approach to locking for the
fsmpage->fp_next_slot field in functions like fsm_search_avail() (i.e.
not using real atomic ops, even though we could) is one source of
problems here. That might end up necessitating fixing the on-disk
format, just to get the FSM to behave sensibly -- assuming that the
value won't be too stale in practice is extremely dubious.

This fp_next_slot business interacts poorly with the "extend a
relation by multiple blocks" logic added by commit 719c84c1be5 --
concurrently inserting backends are liable to get the same heap block
from the FSM, causing "collisions". That almost seems like a bug IMV.
We really shouldn't be given out the same block twice, but that's what
my own custom instrumentation shows happens here. With atomic ops, it
isn't a big deal to restart using a compare-and-swap at the end (when
we set/reset fp_next_slot for other backends).

> Other ideas?

I've been experimenting with changing the way that we enforce heap
fill factor with calls to heap_insert() (and even heap_update()) that
happen to occur at a "natural temporal boundary". This works by
remembering an XID alongside the target block in the relcache when the
target block is set. When we have an existing target block whose XID
does not match our backend's current XID (i.e. it's an old XID for the
backend), then that means we're at one of these boundaries. We require
that the page has a little more free space before we'll insert on it
when at a boundary. If we barely have enough space to insert the
incoming heap tuple, and it's the first of a few tuples the
transaction will ultimately insert, then we should start early on a
new page instead of using the last little bit of space (note that the
"last little bit" of space does not include the space left behind by
fill factor). The overall effect is that groups of related tuples are
much less likely to span a heap page boundary unless and until we have
lots of updates -- though maybe not even then. I think that it's very
common for transactions to insert a group of 2 - 15 logically related
tuples into a table at a time.

Roughly speaking, you can think of this as the heapam equivalent of
the nbtree page split choice logic added by commit fab25024. We ought
to go to at least a little bit of effort to minimize the number of
distinct XIDs that are present on each heap page (in the tuple
headers). We can probably figure out heuristics that result in
respecting heap fill factor on average, while giving inserts (and even
non-HOT updates) a little wiggle room when it comes to heap page
boundaries.

By applying both of these techniques together (locality/page split
thing and real atomic ops for fp_next_slot) the prototype patch I'm
working on mostly restores the system's current ability to reuse space
(as measured by the final size of relations when everything is done),
while maintaining most of the performance benefits of not using the
FSM at all. The code is still pretty rough, though.

I haven't decided how far to pursue this. It's not as if there are
that many ways to make TPC-C go 5%+ faster left; it's very
write-heavy, and stresses many different parts of the system all at
once. I'm sure that anything like my current prototype patch will be
controversial, though. Maybe it will be acceptable if we only change
the behavior for people that explicitly set heap fillfactor.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-08-25 02:38:39 Avoid unnecessary ReplicationSlotControl lwlock acquistion
Previous Message Fujii Masao 2020-08-25 02:12:03 Re: Creating a function for exposing memory usage of backend process