Re: Eager page freeze criteria clarification

From: Joe Conway <mail(at)joeconway(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>
Subject: Re: Eager page freeze criteria clarification
Date: 2023-12-09 14:24:01
Message-ID: 530b4ab3-8a18-4378-bc0c-c55449c7b19a@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/8/23 23:11, Melanie Plageman wrote:
> On Wed, Nov 8, 2023 at 9:23 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
>> The next step is to devise different heuristics and measure their
>> efficacy. IMO, the goal of the algorithm it is to freeze pages in a
>> relation such that we drive early unfreezes/freezes -> 0 and pages
>> frozen/number of pages of a certain age -> 1.
>
> Attached is a patchset with an adaptive freezing algorithm that works
> well. It keeps track of the pages' unmodified duration after having been
> set all-visible and uses that to predict how likely a page is to be
> modified given its age.
>
> Each time an all-visible page is modified by an update, delete, or tuple
> lock, if the page was all-visible for less than target_freeze_duration
> (a new guc that specifies the minimum amount of time you would like data
> to stay frozen), it is counted as an "early unset" and the duration that
> it was unmodified is added into an accumulator. Before each relation is
> vacuumed, we calculate the mean and standard deviation of these
> durations using the accumulator. This is used to calculate a page LSN
> threshold demarcating pages with a < 5% likelihood of being modified
> before target_freeze_duration. We don't freeze pages younger than this
> threshold.
>
> I've done away with the ring buffer of vacuums and the idea of
> attributing an "unset" to the vacuum that set it all visible. Instead,
> using an "accumulator", I keep a running sum of the page ages, along
> with the cardinality of the accumulated set and the sum squared of page
> ages. This data structure allows us to extract the mean and standard
> deviation, at any time, from an arbitrary number of values in constant
> space; and is used to model the pattern of these unsets as a normal
> distribution that we can use to try and predict whether or not a page
> will be modified.
>
> Values can be "removed" from the accumulator by simply decrementing its
> cardinality and decreasing the sum and sum squared by a value that will
> not change the mean and standard deviation of the overall distribution.
> To adapt to a table's changing access patterns, we'll need to remove
> values from this accumulator over time, but this patch doesn't yet
> decide when to do this. A simple solution may be to cap the cardinality
> of the accumulator to the greater of 1% of the table size, or some fixed
> number of values (perhaps 200?). Even without such removal of values,
> the distribution recorded in the accumulator will eventually skew toward
> more recent data, albeit at a slower rate.
>
> This algorithm is able to predict when pages will be modified before
> target_freeze_threshold as long as their modification pattern fits a
> normal distribution -- this accommodates access patterns where, after
> some period of time, pages become less likely to be modified the older
> they are. As an example, however, access patterns where modification
> times are bimodal aren't handled well by this model (imagine that half
> your modifications are to very new data and the other half are to data
> that is much older but still younger than target_freeze_duration). If
> this is a common access pattern, the normal distribution could easily be
> swapped out for another distribution. The current accumulator is capable
> of tracking a distribution's first two moments of central tendency (the
> mean and standard deviation). We could track more if we wanted to use a
> fancier distribution.
>
> We also must consider what to do when we have few unsets, e.g. with an
> insert-only workload. When there are very few unsets (I chose 30 which
> the internet says is the approximate minimum number required for the
> central limit theorem to hold), we can choose to always freeze freezable
> pages; above this limit, the calculated page threshold is used. However,
> we may not want to make predictions based on 31 values either. To avoid
> this "cliff", we could modify the algorithm a bit to skew the mean and
> standard deviation of the distribution using a confidence interval based
> on the number of values we've collected.
>
> The goal is to keep pages frozen for at least target_freeze_duration.
> target_freeze_duration is in seconds and pages only have a last
> modification LSN, so target_freeze_duration must be converted to LSNs.
> To accomplish this, I've added an LSNTimeline data structure, containing
> XLogRecPtr, TimestampTz pairs stored with decreasing precision as they
> age. When we need to translate the guc value to LSNs, we linearly
> interpolate it on this timeline. For the time being, the global
> LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There
> is no reason it can't be updated with some other cadence and/or by some
> other process (nothing about it is inherently tied to vacuum). The
> cached translated value of target_freeze_duration is stored in each
> table's stats. This is arbitrary as it is not a table-level stat.
> However, it needs to be located somewhere that is accessible on
> update/delete. We may want to recalculate it more often than once per
> table vacuum, especially in case of long-running vacuums.
>
> To benchmark this new heuristic (I'm calling it algo 6 since it is the
> 6th I've proposed on this thread), I've used a modified subset of my
> original workloads:
>
> Workloads
>
> C. Shifting hot set
> 32 clients inserting multiple rows and then updating an indexed
> column of a row on a different page containing data they formerly
> inserted. Only recent data is updated.
>
> H. Append only table
> 32 clients, each inserting a single row at a time
>
> I. Work queue
> 32 clients, each inserting a row, then updating an indexed column in
> 2 different rows in nearby pages, then deleting 1 of the updated rows
>
>
> Workload C:
>
> Algo | Table | AVs | Page Freezes | Pages Frozen | % Frozen
> -----|----------|-----|--------------|--------------|---------
> 6 | hot_tail | 14 | 2,111,824 | 1,643,217 | 53.4%
> M | hot_tail | 16 | 241,605 | 3,921 | 0.1%
>
>
> Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO time | TPS
> -----|--------|------------------|------------|------------|--------
> 6 | 193 | 5,473,949 | 12,793,574 | 14,870 | 28,397
> M | 207 | 5,213,763 | 20,362,315 | 46,190 | 28,461
>
> Algorithm 6 freezes all of the cold data and doesn't freeze the current
> working set. The notable thing is how much this reduces overall system
> I/O. On master, autovacuum is doing more than 3x the I/O and the rest of
> the system is doing more than 1.5x the I/O. I suspect freezing data when
> it is initially vacuumed is saving future vacuums from having to evict
> pages of the working set and read in cold data.
>
>
> Workload H:
>
> Algo | Table | AVs | Page Freezes | Pages Frozen | % frozen
> -----|-----------|-----|--------------|--------------|---------
> 6 | hthistory | 22 | 668,448 | 668,003 | 87%
> M | hthistory | 22 | 0 | 0 | 0%
>
>
> Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO time | TPS
> -----|--------|------------------|-----------|------------|--------
> 6 | 14 | 725,103 | 725,575 | 1 | 43,535
> M | 13 | 693,945 | 694,417 | 1 | 43,522
>
> The insert-only table is mostly frozen at the end. There is more I/O
> done but not at the expense of TPS. This is exactly what we want.
>
>
> Workload I:
>
> Algo | Table | AVs | Page Freezes | Pages Frozen | % Frozen
> -----|-----------|-----|--------------|--------------|---------
> 6 | workqueue | 234 | 0 | 4,416 | 78%
> M | workqueue | 234 | 0 | 4,799 | 87%
>
>
> Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO Time | TPS
> -----|--------|------------------|-----------|------------|--------
> 6 | 74 | 64,345 | 64,813 | 1 | 36,366
> M | 73 | 63,468 | 63,936 | 1 | 36,145
>
> What we want is for the work queue table to freeze as little as
> possible, because we will be constantly modifying the data. Both on
> master and with algorithm 6 we do not freeze tuples on any pages. You
> will notice, however, that much of the table is set frozen in the VM at
> the end. This is because we set pages all frozen in the VM if they are
> technically all frozen even if we do not freeze tuples on the page. This
> is inexpensive and not under the control of the freeze heuristic.
>
> Overall, the behavior of this new adaptive freezing method seems to be
> exactly what we want. The next step is to decide how many values to
> remove from the accumulator and benchmark cases where old data is
> deleted.
>
> I'd be delighted to receive any feedback, ideas, questions, or review.

This is well thought out, well described, and a fantastic improvement in
my view -- well done!

I do think we will need to consider distributions other than normal, but
I don't know offhand what they will be.

However, even if we assume a more-or-less normal distribution, we should
consider using subgroups in a way similar to Statistical Process
Control[1]. The reasoning is explained in this quote:

The Math Behind Subgroup Size

The Central Limit Theorem (CLT) plays a pivotal role here. According
to CLT, as the subgroup size (n) increases, the distribution of the
sample means will approximate a normal distribution, regardless of
the shape of the population distribution. Therefore, as your
subgroup size increases, your control chart limits will narrow,
making the chart more sensitive to special cause variation and more
prone to false alarms.

[1]
https://www.qualitygurus.com/rational-subgrouping-in-statistical-process-control/

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-12-09 16:00:00 Re: Assistance Needed: PostgreSQL Migration Errors 13.2 to 15
Previous Message Alvaro Herrera 2023-12-09 12:31:32 Re: Why are wal_keep_size, max_slot_wal_keep_size requiring server restart?