Re: Eager page freeze criteria clarification

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Eager page freeze criteria clarification
Date: 2023-10-12 00:43:46
Message-ID: 20231012004346.dvdltiwjvsb7ei6v@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Robert, Melanie and I spent an evening discussing this topic around
pgconf.nyc. Here are, mildly revised, notes from that:

First a few random points that didn't fit with the sketch of an approach
below:

- Are unlogged tables a problem for using LSN based heuristics for freezing?

We concluded, no, not a problem, because aggressively freezing does not
increase overhead meaningfully, as we would already dirty both the heap and VM
page to set the all-visible flag.

- "Unfreezing" pages that were frozen hours / days ago aren't too bad and can
be desirable.

The main thing we are worried about is repeated freezing / unfreezing of
pages within a relatively short time period.

- Computing an average "modification distance" as I (Andres) proposed efor
each page is complicated / "fuzzy"

The main problem is that it's not clear how to come up with a good number
for workloads that have many more inserts into new pages than modifications
of existing pages.

It's also hard to use average for this kind of thing, e.g. in cases where
new pages are frequently updated, but also some old data is updated, it's
easy for the updates to the old data to completely skew the average, even
though that shouldn't prevent us from freezing.

- We also discussed an idea by Robert to track the number of times we need to
dirty a page when unfreezing and to compare that to the number of pages
dirtied overall (IIRC), but I don't think we really came to a conclusion
around that - and I didn't write down anything so this is purely from
memory.

A rough sketch of a freezing heuristic:

- We concluded that to intelligently control opportunistic freezing we need
statistics about the number of freezes and unfreezes

- We should track page freezes / unfreezes in shared memory stats on a
per-relation basis

- To use such statistics to control heuristics, we need to turn them into
rates. For that we need to keep snapshots of absolute values at certain
times (when vacuuming), allowing us to compute a rate.

- If we snapshot some stats, we need to limit the amount of data that occupies

- evict based on wall clock time (we don't care about unfreezing pages
frozen a month ago)

- "thin out" data when exceeding limited amount of stats per relation
using random sampling or such

- need a smarter approach than just keeping N last vacuums, as there are
situations where a table is (auto-) vacuumed at a high frequency

- only looking at recent-ish table stats is fine, because we
- a) don't want to look at too old data, as we need to deal with changing
workloads

- b) if there aren't recent vacuums, falsely freezing is of bounded cost

- shared memory stats being lost on crash-restart/failover might be a problem

- we certainly don't want to immediate store these stats in a table, due
to the xid consumption that'd imply

- Attributing "unfreezes" to specific vacuums would be powerful:

- "Number of pages frozen during vacuum" and "Number of pages unfrozen that
were frozen during the same vacuum" provides numerator / denominator for
an "error rate"

- We can perform this attribution by comparing the page LSN with recorded
start/end LSNs of recent vacuums

- If the freezing error rate of recent vacuums is low, freeze more
aggressively. This is important to deal with insert mostly workloads.

- If old data is "unfrozen", that's fine, we can ignore such unfreezes when
controlling "freezing aggressiveness"

- Ignoring unfreezing of old pages is important to e.g. deal with
workloads that delete old data

- This approach could provide "goals" for opportunistic freezing in a
somewhat understandable way. E.g. aiming to rarely unfreeze data that has
been frozen within 1h/1d/...

Around this point my laptop unfortunately ran out of battery. Possibly the
attendees of this mini summit also ran out of steam (and tea).

We had a few "disagreements" or "unresolved issues":

- How aggressive should we be when we have no stats?

- Should the freezing heuristic take into account whether freezing would
require an FPI? Or whether page was not in s_b, or ...

I likely mangled this substantially, both when taking notes during the lively
discussion, and when revising them to make them a bit more readable.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-10-12 01:10:04 Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Previous Message Michael Paquier 2023-10-12 00:26:52 Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag