Re: Deleting older versions in unique indexes to avoid page splits

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Victor Yegorov <vyegorov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Deleting older versions in unique indexes to avoid page splits
Date: 2021-01-20 18:53:17
Message-ID: CAH2-Wzkmz_wUfm09XmGoAtCHaxa-+AA5suVY_eY5p0psgi5zAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 20, 2021 at 5:33 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > Victor independently came up with a benchmark that ran over several
> > hours, with cleanup consistently held back by ~5 minutes by a long
> > running transaction:
> >
>
> AFAICS, the long-running transaction used in the test is below:
> SELECT abalance, pg_sleep(300) FROM pgbench_accounts WHERE mtime >
> now() - INTERVAL '15min' ORDER BY aid LIMIT 1;
>
> This shouldn't generate a transaction id so would it be sufficient to
> hold back the clean-up?

It will hold back clean-up because it holds open a snapshot. Whether
or not the long running transaction has been allocated a true XID (not
just a VXID) is irrelevant. Victor's test case is perfectly valid.

In general there are significant benefits for cases with long-running
transactions, which will be quite apparent if you do something simple
like run pgbench (a script with non-HOT updates) while a REPEATABLE
READ transaction runs in psql (and has actually acquired a snapshot by
running a simple query -- the xact snapshot is acquired lazily). I
understand that this will be surprising if you believe that the
problem in these cases is strictly that there are too many "recently
dead" versions that still need to be stored (i.e. versions that
cleanup simply isn't able to remove, given the xmin horizon
invariant). But it's now clear that that's not what really happens in
most cases with a long running transaction. What we actually see is
that local page-level inefficiencies in cleanup were (and perhaps to
some degree still are) a much bigger problem than the inherent
inability of cleanup to remove even one or two tuples. This is at
least true until the bloat problem becomes a full-blown disaster
(because cleanup really is inherently restricted by the global xmin
horizon, and therefore hopelessly behind).

In reality there are seldom that many individual logical rows that get
updated more than a few times in (say) any given one hour period. This
is true even with skewed updates -- the skew is almost never visible
at the level of an individual leaf page. The behavior we see now that
we have bottom-up index deletion is much closer to the true optimal
behavior for the general approach Postgres takes to cleanup of garbage
tuples, since it tends to make the number of versions for any given
logical row as low as possible (within the confines of the global xmin
horizon limitations for cleanup).

Of course it would also be helpful to have something like zheap --
some mechanism that can store "recently dead" versions some place
where they at least don't bloat up the main relation structures. But
that's only one part of the big picture for Postgres MVCC garbage. We
should not store garbage tuples (i.e. those that are dead rather than
just recently dead) *anywhere*.

> First, it is not clear to me if that has properly simulated the
> long-running test but even if it is what I intend to say was to have
> an open long-running transaction possibly for the entire duration of
> the test? If we do that, we will come to know if there is any overhead
> and if so how much?

I am confident that you are completely wrong about regressing cases
with long-running transactions, except perhaps in some very narrow
sense that is of little practical relevance. Victor's test case did
result in a small loss of throughput, for example, but that's a small
price to pay to not have your indexes explode (note also that most of
the indexes weren't even used for reads, so in the real world it would
probably also improve throughput even in the short term). FWIW the
small drop in TPS probably had little to do with the cost of visiting
the heap for visibility information. Workloads must be made to "live
within their means". You can call that a regression if you like, but I
think that almost anybody else would take issue with that
characterization.

Slowing down non-HOT updaters in these extreme cases may actually be a
good thing, even when bottom-up deletion finally becomes ineffective.
It can be thought of as backpressure. I am not worried about slowing
down something that is already hopelessly inefficient and
unsustainable. I'd go even further than that, in fact -- I now wonder
if we should *deliberately* slow them down some more!

> Test with 2 un-modified indexes
> ===============================
> create table t1(c1 int, c2 int, c3 int, c4 char(10));
> create index idx_t1 on t1(c1);
> create index idx_t2 on t1(c2);
> create index idx_t3 on t1(c3);
>
> insert into t1 values(generate_series(1,5000000), 1, 10, 'aaaaaa');
> update t1 set c2 = 2;

> postgres=# update t1 set c2 = 2;
> UPDATE 5000000
> Time: 46533.530 ms (00:46.534)
>
> With HEAD
> ==========
> postgres=# update t1 set c2 = 2;
> UPDATE 5000000
> Time: 52529.839 ms (00:52.530)
>
> I have dropped and recreated the table after each update in the test.

Good thing that you remembered to drop and recreate the table, since
otherwise bottom-up index deletion would look really good!

Besides, this test case is just ludicrous. I bet that Postgres was
always faster than other RDBMSs here, because Postgres is relatively
unconcerned about making updates like this sustainable.

> I have read your patch and have some decent understanding but
> obviously, you and Victor will have a better idea. I am not sure what
> I wrote in my previous email which made you say so. Anyway, I hope I
> have made my point clear this time.

I don't think that you fully understand the insights behind the patch.
Understanding how the patch works mechanistically is not enough.

This patch is unusual in that you really need to think about emergent
behaviors to understand it. That is certainly a difficult thing to do,
and it's understandable that even an expert might not grok it without
considering it carefully. What annoys me here is that you didn't seem
to seriously consider the *possibility* that something like that
*might* be true, even after I pointed it out several times. If I was
looking at a project that you'd worked on just after it was committed,
and something seemed *obviously* wrong, I know that I would think long
and hard about the possibility that my understanding was faulty in
some subtle though important way. I try to always do this when the
apparent problem is too simple and obvious -- I know it's unlikely
that a respected colleague would make such a basic error (which is not
to say that there cannot still be some error).

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-01-20 18:54:43 Re: strange error reporting
Previous Message Alvaro Herrera 2021-01-20 18:44:46 Re: strange error reporting