Re: Eager page freeze criteria clarification

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Eager page freeze criteria clarification
Date: 2023-09-23 19:53:33
Message-ID: CAAKRu_a9zUx8JcZkb4UxrZrqcVr_d9f0u8-H=g61kq0YLS1Tbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 28, 2023 at 4:30 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
> On Mon, Aug 28, 2023 at 12:26 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > In row D, your algorithms are all bad, really bad. I don't quite
> > understand how it can be that bad, actually.
>
> So, I realize now that this test was poorly designed. I meant it to be a
> worst case scenario, but I think one critical part was wrong. In this
> example one client is going at full speed inserting a row and then
> updating it. Then another rate-limited client is deleting old data
> periodically to keep the table at a constant size. I meant to bulk load
> the table with enough data that the delete job would have data to delete
> from the start. With the default autovacuum settings, over the course of
> 45 minutes, I usually saw around 40 autovacuums of the table. Due to the
> rate limiting, the first autovacuum of the table ends up freezing many
> pages that are deleted soon after. Thus the total number of page freezes
> is very high.
>
> I will redo benchmarking of workload D and start the table with the
> number of rows which the DELETE job seeks to maintain. My back of the
> envelope math says that this will mean ratios closer to a dozen (and not
> 5000).
...
> I'll rerun workload D in a more reasonable way and be back with results.

Hi,

I have edited and rerun all of the benchmarks for a subset of the
algorithms. I ran workloads A-I, except for G (G is not an interesting
workload), on Postgres master as well as Postgres with freeze heuristic
algorithms 4 and 5.

As a refresher, algorithms 4 and 5 are as follows:

Freeze tuples on a page opportunistically if the page would be totally
frozen and:

4. Buffer is already dirty and no FPI is required OR page LSN is older
than 33% of the LSNs since the last vacuum of the table.

5. Buffer is already dirty and no FPI is required AND page LSN is older
than 33% of the LSNs since the last vacuum of the table.

On master, the heuristic is to freeze a page opportunistically if it
would be totally frozen and if pruning emitted an FPI.

I made a few configuration changes for all benchmarks -- most notably
autovacuum is on for all workloads and autovacuum_naptime is decreased
to 10 seconds. The runtime of all time-based workloads was changed to 40
minutes. All workloads were run for a fixed duration except for the COPY
workload (workload F).

It is also worth noting that data loaded into the standard pgbench
tables (with pgbench -i) was loaded with COPY and not COPY FREEZE. Also,
pgbench -i and the pgbench runs were passed the --no-vacuum option.

The updated workloads are as follows:

A. Gaussian TPC-B like + select-only:
pgbench scale 450 (DB < SB) with indexes on updated columns
WL 1: 16 clients doing TPC-B like pgbench but with Gaussian access
distribution (with parameter 6) for updated tables
WL 2: 2 clients, rate-limited to 1000 TPS, doing select-only pgbench

B. TPC-B like
pgbench scale 450
16 clients doing built-in pgbench TPC-C like script

C. Shifting hot set:
16 clients inserting a single row then updating an indexed column in
that row

D. Shifting hot set, delete old data
WL 1: 16 clients inserting one row then updating an indexed column in
that row
WL 2: 1 client, rate limited to 10 TPS, deleting data more than 1
minute old

E. Shifting hot set, delete new data, access new data
WL 1: 16 clients cycling through 2 single row inserts, an update of
an indexed column in a single recently inserted row, and a delete of
a single recently inserted row
WL 2: 1 client, rate-limited to 0.2 TPS, selecting data from the last
5 minutes

F. Many COPYs
1 client, copying a total of 90 GB of data in 70 individual COPYs

G. N/A

H. Append only table
16 clients, inserting a single row at a time

I. Work queue
WL 1: 16 clients inserting a single row, updating a non-indexed
column in that row twice, then deleting that row
WL 2: 1 client, rate-limited to 0.02 TPS, COPYing data into another
table

Below are some of the data I collected, including how much WAL was
generated, how much IO various backends did, TPS, and P99 latency.
These were collected at the end of the pgbench run.

Most numbers were rounded to the nearest whole number to make the chart
easier to see. So, for example, a 0 in WAL GB does not mean that no WAL
was emitted. P99 latency below 0 was rounded to a single decimal place.

All IO time is in milliseconds.

"P99 lat" was calculated using the "time" field from pgbench's
"per-transaction logging" option [1]. It is the "transaction's elapsed time"
or duration. I converted it to milliseconds.

"pages frozen" here refers to the number of pages marked frozen in the
visibility map at the end of the run. "page freezes" refers to the
number of times vacuum froze tuples in lazy_scan_prune(). This does not
include cases in which the page was found to be all frozen but no tuples
were explicitly frozen.

"AVs" is the number of autovacuums for the table or tables specified.
"M" (under algo) indicates unpatched Postgres master.

pgbench_accounts is abbreviated as "Acct", pgbench_history as "Hist",
and pgbench_branches + pgbench_tellers as B+T.

Workload A:

+------+--------+---------------------+---------------------+------------------+
| algo | WAL GB | cptr bgwriter writes | other reads/writes | IO time AV worker|
+------+--------+---------------------+---------------------+------------------+
| M | 37 | 3,064,470 | 119,021 | 199 |
| 4 | 41 | 3,071,229 | 119,010 | 199 |
| 5 | 37 | 3,067,462 | 119,043 | 198 |
+------+--------+---------------------+---------------------+------------------+

+------+---------+-----------+----------------+----------------------+
| algo | TPS read | TPS write | P99 latency read | P99 latency write |
+------+---------+-----------+----------------+----------------------+
| M | 1000 | 5,391 | 0.1 | 4.0 |
| 4 | 1000 | 5,391 | 0.1 | 3.9 |
| 5 | 1000 | 5,430 | 0.1 | 3.9 |
+------+---------+-----------+----------------+----------------------+

+------+---------+---------+---------+
| algo | Acct AVs | B+T AVs| Hist AVs|
+------+---------+---------+---------+
| M | 2 | 459 | 21 |
| 4 | 2 | 467 | 20 |
| 5 | 2 | 469 | 21 |
+------+---------+---------+---------+

+------+------------------+------------------+---------------+
| algo | Acct Page Freezes| Acct Pages Frozen| Acct % Frozen |
+------+------------------+--------------------+-------------+
| M | 2,555 | 646 | 0% |
| 4 | 866,141 | 385,039 | 52% |
| 5 | 4 | 0 | 0% |
+------+------------------+--------------------+-------------+

+------+-------------------+-------------------+---------------+
| algo | Hist Page Freezes | Hist Pages Frozen | Hist % Frozen |
+------+-------------------+-------------------+---------------+
| M | 0 | 0 | 0% |
| 4 | 70,227 | 69,835 | 85% |
| 5 | 19,204 | 19,024 | 23% |
+------+-------------------+-------------------+---------------+

+------+-------------------+-------------------+---------------+
| algo | B+T Page Freezes | B+T Pages Frozen | B+T % Frozen |
+------+-------------------+-------------------+---------------+
| M | 152 | 297 | 33% |
| 4 | 74,323 | 809 | 100% |
| 5 | 6,215 | 384 | 48% |
+------+-------------------+-------------------+---------------+

Algorithm 4 leads to a substantial increase in the amount of WAL (due to
additional FPIs). However, the impact on P99 latency and TPS is minimal.
The relatively low number of page freezes on pgbench_accounts is likely
due to the fact that freezing only happens in lazy_scan_prune(), before
index vacuuming. pgbench_accounts' updated column was indexed, so many
tuples could not be removed before freezing, making pages ineligible for
opportunistic freezing.

This does not explain why algorithm 5 froze fewer pages of
pgbench_accounts than even master. I plan to investigate this further. I
also noticed that the overall write TPS seems low for my system. It is
likely due to row-level contention because of the Gaussian access
distribution, but it merits further analysis.

Additionally, I plan to redesign this benchmark. I will run a single
pgbench instance and use pgbench's script weight feature to issue more
write script executions than select-only script executions. Using two
pgbench instances and rate-limiting the select-only script makes it
difficult to tell if there is a potential impact to concurrent read
throughput.

Workload B:

+------+--------+---------------------+-------------------+------------------+
| algo | WAL GB | cptr bgwriter writes| other reads/writes| IO time AV worker|
+------+--------+---------------------+---------------------+----------------+
| M | 73 | 5,972,992 | 647,342 | 75 |
| 4 | 74 | 5,976,167 | 640,682 | 66 |
| 5 | 73 | 5,962,465 | 644,060 | 70 |
+------+--------+---------------------+---------------------+----------------+

+------+--------+-------------+
| algo | TPS | P99 latency |
+------+--------+-------------+
| M | 21,336 | 1.5 |
| 4 | 21,458 | 1.5 |
| 5 | 21,384 | 1.5 |
+------+--------+-------------+

+------+---------+---------+---------+
| algo | Acct AVs| B+T AVs | Hist AVs|
+------+---------+---------+---------+
| M | 1 | 472 | 22 |
| 4 | 1 | 470 | 21 |
| 5 | 1 | 469 | 19 |
+------+---------+---------+---------+

+------+------------------+--------------------+---------------+
| algo | Acct Page Freezes| Acct Pages Frozen | Acct % Frozen |
+------+------------------+--------------------+---------------+
| M | 0 | 0 | 0% |
| 4 | 269,850 | 0 | 0% |
| 5 | 194 | 0 | 0% |
+------+------------------+-------------------+----------------+

+------+-------------------+---------------------+---------------+
| algo | B+T Page Freezes | B+T Pages Frozen | B+T % Frozen |
+------+-------------------+---------------------+---------------+
| M | 0 | 123 | 34% |
| 4 | 34,268 | 148 | 44% |
| 5 | 25 | 114 | 33% |
+------+-------------------+---------------------+---------------+

+------+-------------------+---------------------+---------------+
| algo | Hist Page Freezes | Hist Pages Frozen | Hist % Frozen |
+------+-------------------+---------------------+---------------+
| M | 0 | 0 | 0% |
| 4 | 297,083 | 296,688 | 90% |
| 5 | 85,573 | 85,353 | 26% |
+------+-------------------+---------------------+---------------+

Algorithm 4 is more aggressive and does much more freezing. Since there
are no indexes, more page freezing can happen for updated tables. These
pages, however, are uniformly updated and do not stay frozen, so the
freezing is pointless. This freezing doesn't seem to substantially
impact P99 latency, TPS, or volume of WAL emitted.

Workload C:

+------+--------+---------------------+---------------------+------------------+
| algo | WAL GB | cptr bgwriter writes| other reads/writes | IO time AV worker|
+------+--------+---------------------+---------------------+------------------+
| M | 16 | 910,583 | 61,124 | 75 |
| 4 | 17 | 896,213 | 61,124 | 75 |
| 5 | 15 | 900,068 | 61,124 | 76 |
+------+--------+---------------------+---------------------+------------------+

+------+-------+--------------+
| algo | TPS | P99 latency |
+------+-------+--------------+
| M | 10,653| 2 |
| 4 | 10,620| 2 |
| 5 | 10,591| 2 |
+------+-------+--------------+

+------+-----+--------------+-------------+----------+
| algo | AVs | Page Freezes | Pages Frozen | % Frozen|
+------+-----+--------------+-------------+----------+
| M | 4 | 130,600 | 0 | 0% |
| 4 | 4 | 364,781 | 197,135 | 60% |
| 5 | 4 | 0 | 0 | 0% |
+------+-----+--------------+-------------+----------+

It is notable that algorithm 5 again did not freeze any pages.

Workload D:

+------+--------+---------------------+---------------------+------------------+
| algo | WAL GB | cptr bgwriter writes| other reads/writes | IO time AV worker|
+------+--------+---------------------+---------------------+------------------+
| M | 12 | 50,745 | 481 | 1.02 |
| 4 | 12 | 49,590 | 481 | 0.99 |
| 5 | 12 | 51,044 | 481 | 0.98 |
+------+--------+---------------------+---------------------+------------------+

+------+----------+------------+-------------------+--------------------+
| algo | TPS write | TPS delete| P99 latency write | P99 latency delete |
+------+----------+------------+--------------------+-------------------+
| M | 11,034 | 10 | 1.7 | 107 |
| 4 | 10,992 | 10 | 1.7 | 93 |
| 5 | 11,014 | 10 | 1.7 | 95 |
+------+----------+------------+-------------------+--------------------+

+------+-----+--------------+-------------+-----------+
| algo | AVs | Page Freezes | Pages Frozen | % Frozen |
+------+-----+--------------+-------------+-----------+
| M | 230| 2,066 | 332 | 6% |
| 4 | 233| 631,220 | 1,404 | 25% |
| 5 | 231| 456,066 | 2,352 | 44% |
+------+-----+--------------+--------------+----------+

The P99 delete latency is higher for master. However write TPS is also
higher, so it is possible that the decreased P99 latency for the delete
is unrelated to freezing. Notably algorithm 4 froze least effectively --
it did the most page freezes per page frozen at the end of the run.

Workload E:

+------+--------+---------------------+---------------------+------------------+
| algo | WAL GB | cptr bgwriter writes | other reads/writes | IO time AV worker|
+------+--------+---------------------+---------------------+------------------+
| M | 15 | 738,854 | 482 | 1.0 |
| 4 | 15 | 745,182 | 482 | 0.8 |
| 5 | 15 | 733,916 | 482 | 0.9 |
+------+--------+---------------------+---------------------+------------------+

+------+---------+-----------+--------------------+------------------+
| algo | TPS read | TPS write| P99 latency read ms| P99 latency write|
+------+---------+-----------+--------------------+------------------+
| M | 0.2 | 20,171 | 1,655 | 2 |
| 4 | 0.2 | 20,230 | 1,611 | 2 |
| 5 | 0.2 | 20,165 | 1,579 | 2 |
+------+---------+-----------+--------------------+------------------+

+------+-----+--------------+--------------+----------+
| algo | AVs | Page Freezes | Pages Frozen | % Frozen |
+------+-----+--------------+--------------+----------+
| M | 23 | 2,396 | 0 | 0% |
| 4 | 23 | 231,816 | 121,373 | 71% |
| 5 | 23 | 68,302 | 36,294 | 21% |
+------+-----+--------------+--------------+----------+

The additional page freezes done by algorithm 4 seem somewhat effective
and do not appear to have an impact on throughput, latency, or a large
impact on WAL volume.

Workload F:

+------+--------+---------------------+--------------------+------------------+
| algo | WAL GB | cptr bgwriter writes| other reads/writes | IO time AV worker|
+------+--------+---------------------+---------------------+-----------------+
| M | 173 | 1,202,231 | 53,957,448 | 12,389 |
| 4 | 189 | 1,212,521 | 55,589,140 | 13,084 |
| 5 | 173 | 1,194,242 | 54,260,118 | 13,407 |
+------+--------+---------------------+--------------------+------------------+

+------+--------------+
| algo | P99 latency |
+------+--------------+
| M | 19875 |
| 4 | 19314 |
| 5 | 19701 |
+------+--------------+

+------+-----+--------------+--------------+----------+
| algo | AVs | Page Freezes | Pages Frozen | % Frozen |
+------+-----+--------------+--------------+----------+
| M | 3 | 0 | 0 | 0% |
| 4 | 3 | 2,598,727 | 2,598,725 | 24% |
| 5 | 3 | 475,063 | 475,064 | 4% |
+------+-----+--------------+--------------+----------+

Algorithm 4 produced a notably higher volume of WAL but also managed to
freeze a decent portion of the table. P99 latency is higher -- meaning
each COPY did take longer with algorithm 4.

Workload H:

+------+--------+----------------------+---------------------+------------------+
| algo | WAL GB | cptr bgwriter writes | other reads/writes | IO time
AV worker |
+------+--------+----------------------+---------------------+------------------+
| M | 12 | 922,121 | 318 |
1.03 |
| 4 | 16 | 921,543 | 318 |
0.90 |
| 5 | 12 | 921,309 | 318 |
0.74 |
+------+--------+----------------------+---------------------+------------------+

+------+--------+--------------+
| algo | TPS | P99 latency |
+------+--------+--------------+
| M | 21,926 | 0.98 |
| 4 | 22,028 | 0.97 |
| 5 | 21,919 | 0.99 |
+------+--------+--------------+

+------+-----+--------------+-------------+-----------+
| algo | AVs | Page Freezes | Pages Frozen | % Frozen |
+------+-----+--------------+--------------+----------+
| M | 6 | 0 | 0 | 0% |
| 4 | 6 | 560,317 | 560,198 | 94% |
| 5 | 6 | 11,921 | 11,921 | 2% |
+------+-----+--------------+--------------+----------+

Algorithm 4 produced a much higher volume of WAL and froze much more of
the table. This did not negatively impact throughput or latency.
Workload 5 froze dramatically few pages as compared to algorithm 4.

Workload I:

+------+--------+---------------------+---------------------+-------------------+
| algo | WAL GB | cptr bgwriter writes | other reads/writes | IO time
AV worker |
+------+--------+---------------------+---------------------+-------------------+
| M | 32 | 123,947 | 32,733,204 |
16,803 |
| 4 | 59 | 124,233 | 32,734,188 |
17,151 |
| 5 | 32 | 124,324 | 32,733,166 |
17,122 |
+------+--------+---------------------+---------------------+-------------------+

+------+-----------+---------+------------------+------------------+
| algo | TPS queue | TPS copy| P99 latency queue| P99 latency copy |
+------+-----------+---------+------------------+------------------+
| M | 5,407 | 0.02 | 4 | 8,386 |
| 4 | 5,277 | 0.02 | 5 | 8,520 |
| 5 | 5,401 | 0.02 | 4 | 8,399 |
+------+-----------+---------+------------------+------------------+

+------+-----------+----------+
| algo | Queue AVs | Copy AVs |
+------+-----------+----------+
| M | 236 | 5 |
| 4 | 236 | 5 |
| 5 | 235 | 5 |
+------+----------+-----------+

+------+-------------------+--------------------+---------------+
| algo | Queue Page Freezes| Queue Pages Frozen | Queue % Frozen|
+------+-------------------+--------------------+---------------+
| M | 1 | 1 | 100% |
| 4 | 1 | 1 | 100% |
| 5 | 1 | 1 | 100% |
+------+-------------------+--------------------+---------------+

+------+-------------------+-------------------+---------------+
| algo | Copy Page Freezes | Copy Pages Frozen | Copy % Frozen |
+------+-------------------+-------------------+---------------+
| M | 0 | 0 | 0% |
| 4 | 3,821,658 | 3,821,655 | 64% |
| 5 | 431,044 | 431,043 | 7% |
+------+-------------------+-------------------+---------------+

This is a case in which algorithm 4 seemed to have a measurable negative
performance impact. It emitted twice the WAL and had substantially worse
P99 latency and throughput.

-------

My takeaways from all of the workload results are as follows:

Algorithm 4 is too aggressive and regresses performance compared to
master.

Algorithm 5 freezes surprisingly few pages, especially in workloads
where only the most recent data is being accessed or modified
(append-only, update recent data).

A work queue-like workload with other concurrent workloads is a
particular challenge for the freeze heuristic, and we should think more
about how to handle this.

We should consider freezing again after index vacuuming and unused tuple
reaping.

In many cases a moderate amount of additional freezing (like algorithm
5) does not impact P99 latency and throughput in a meaningful way.

Next steps:
I plan to make some changes to the benchmarks and rerun them with the
new algorithms suggested on the thread.

- Melanie

[1] https://www.postgresql.org/docs/current/pgbench.html

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2023-09-23 23:22:48 Re: nbtree's ScalarArrayOp array mark/restore code appears to be buggy
Previous Message Nathan Bossart 2023-09-23 19:29:29 Re: bug fix and documentation improvement about vacuumdb