Re: Problems with the FSM, heap fillfactor, and temporal locality

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Problems with the FSM, heap fillfactor, and temporal locality
Date: 2020-08-25 23:41:42
Message-ID: CAH2-WzmqdrjedPLXqSZXq4oSrcbCyZFHsHnQY72o5YPsHjiMvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 25, 2020 at 6:21 AM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> This all definitely sounds quite interesting and the idea to look at the
> XID to see if we're in the same transaction and therefore likely
> inserting a related tuple certainly makes some sense. While I get that
> it might not specifically work with TPC-C, I'm wondering about if we
> could figure out how to make a multi-tuple INSERT use
> heap/table_multi_insert (which seems to only be used by COPY currently,
> and internally thanks to the recent work to use it for some catalog
> tables) and then consider the size of the entire set of tuples being
> INSERT'd when working to find a page, or deciding if we should extend
> the relation.

There are probably quite a variety of ways in which we can capture
locality, and I'm sure that I'm only scratching the surface right now.
I agree that table_multi_insert could definitely be one of them.

John said something about concentrating garbage in certain pages
up-thread. I also wonder if there is some visibility + freeze map
angle on this.

What I see with the benchmark is that the order_line table (the
largest table by quite some margin, and one that grows indefinitely)
does not make much use of the visibility map during VACUUM -- even
though it's the kind of table + workload that you'd hope and expect
would make good use of it if you were looking at it in a real world
situation. Each tuple is only inserted once and later updated once, so
what we really ought to do better. The logs show that
VACUUM/autovacuum dirties lots of pages, probably due to fragmentation
from free space management (though there are undoubtedly other
factors).

The best "locality orientated" reference guide to TPC-C that I've been
able to find is "A Modeling Study of the TPC-C Benchmark", which was
published in 1993 by NASA (shortly after the introduction of TPC-C).
You can get it from:

https://apps.dtic.mil/dtic/tr/fulltext/u2/a264793.pdf (Unfortunately
this reproduction is a low quality photocopy -- ACM members can get a
clear copy.)

If you think about the TPC-C workload at a high level, and Postgres
internals stuff at a low level, and then run the benchmark, you'll
find various ways in which we don't live up to our potential. The big
picture is that the "physical database" is not close enough to the
"logical database", especially over time and after a lot of churn.
This causes problems all over the place, that look like nothing in
particular in profiles.

It's not that TPC-C is unsympathetic to Postgres in any of the usual
ways -- there are very few UPDATEs that affect indexed columns, which
are not really a factor at all. There are also no transactions that
run longer than 2 seconds (any more than ~50ms per execution is
exceptional, in fact). We already do a significant amount of the
necessary garbage collection opportunistically (by pruning) --
probably the vast majority, in fact. In particular, HOT pruning works
well, since fill factor has been tuned. It just doesn't work as well
as you'd hope, in that it cannot stem the tide of fragmentation. And
not just because of heapam's use of the FSM.

If we implemented a simple differential heap tuple compression scheme
within HOT chains (though not across HOT chains/unrelated tuples),
then we'd probably do much better -- we could keep the same logical
tuple on the same heap page much more often, maybe always. For
example, "stock" table is a major source of FPIs, and I bet that this
is greatly amplified by our failure to keep versions of the same
frequently updated tuple together. We can only fit ~25 stock tuples on
each heap page (with heap fill factor at 95, the BenchmarkSQL
default), so individual tuples are ~320 bytes (including tuple
header). If we found a way to store the changed columns for successor
tuples within a HOT chain, then we would do much better -- without
changing the HOT invariant (or anything else that's truly scary). If
our scheme worked by abusing the representation that we use for NULL
values in the successor/HOT tuples (which is not terribly space
efficient), then we could still store about 6 more versions of each
stock tuple on the same page -- the actual changed columns are
typically much much smaller than the unchanged columns. Our 23/24 byte
tuple header is usually small potatoes compared to storing unchanged
values several times.

As I said, the HOT optimization (and opportunistic pruning) already
work well with this benchmark. But I think that they'd work a lot
better if we could just temporarily absorb a few extra versions on the
heap page, so we have enough breathing room to prune before the page
"truly fills to capacity". It could help in roughly the same way that
deduplication now helps in nbtree indexes with "version churn".

I'm also reminded of the nbtree optimization I prototyped recently,
which more or less prevented all unique index bloat provided there is
no long running transaction:

https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com

(Yes, "preventing all unique index bloat provided there is no long
running transaction" is no exaggeration -- it really prevents all
bloat related nbtree page splits, even with hundreds of clients, skew,
etc.)

It seems pretty obvious to me that buying another (say) 2 seconds to
let opportunistic pruning run "before the real damage is done" can be
extremely valuable -- we only need to be able to delay a page split
(which is similar to the case where we cannot continue to store heap
tuples on the same heap page indefinitely) for a couple of seconds at
a time. We only need to "stay one step ahead" of the need to split the
page (or to migrate a logical heap tuple to a new heap page when it
comes to the heap) at any given time -- that alone will totally arrest
the problem.

This is a very important point -- the same set of principles that
helped in nbtree can also be effectively applied to heap pages that
are subject to version churn. (Assuming no long running xacts.)

> Getting a 5% improvement is pretty exciting, very cool and seems worth
> spending effort on.

I'm already at 5% - 7% now. I bet the differential compression of
tuples on a HOT chain could buy a lot more than that. The biggest
emphasis should be placed on stable performance over time, and total
I/O over time -- that's where we're weakest right now.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2020-08-26 00:14:27 Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Previous Message Alvaro Herrera 2020-08-25 23:29:47 Re: [PATCH] Fix Uninitialized scalar variable (UNINIT) (src/backend/access/heap/heapam_handler.c)