| From: | "Greg Burd" <greg(at)burd(dot)me> |
|---|---|
| To: | "David Rowley" <dgrowleyml(at)gmail(dot)com> |
| Cc: | "Nathan Bossart" <nathandbossart(at)gmail(dot)com>, "Sami Imseih" <samimseih(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Robert Treat" <rob(at)xzilla(dot)net>, "Jeremy Schneider" <schneider(at)ardentperf(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: another autovacuum scheduling thread |
| Date: | 2026-03-19 13:49:34 |
| Message-ID: | 3ca1e398-c787-47e9-9afc-8e298b94dac0@app.fastmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, Mar 19, 2026, at 7:44 AM, David Rowley wrote:
> On Thu, 19 Mar 2026 at 22:57, Greg Burd <greg(at)burd(dot)me> wrote:
>> I'm late in the review process. I know David Rowley proposed the unified scoring approach that became the foundation of this patch, and I think that's a great direction. However, I'm concerned that the patch's default scoring weights don't give XID-age urgency sufficient priority over dead-tuple urgency. The weight GUCs (autovacuum_vacuum_score_weight, etc.) can address this, but they max at 1.0, meaning you can only reduce dead-tuple priority, not increase XID priority.
>
Hello David,
> I think that it would be good if you could state *why* you disagree
> with the proposed scoring rather than *that* you disagree. All this
> stuff was talked about around [1]. For me, I don't see what's
> particularly alarming about a table reaching
> autovaccum_max_freeze_age. That GUC is set to less than 10% of the
> total transaction ID space of where the table must be frozen. Why is
> it you think these should take priority over everything else? SLRU
> buffers are configurable since v17, so having to lookup the clog for a
> wider range of xids isn't as big an issue as it used to be, plus
> memory and L3 sizes are bigger than they used to be. Is slow clog
> lookups what you're concerned about? You didn't really say.
Fair point. Let me be more specific.
My concern isn't that wraparound vacuums are inherently alarming, I agree with you that reaching freeze_max_age isn't a crisis. The issue is a scoring-scale problem in the gap between freeze_max_age (200M) and failsafe age (1.6B).
In that 1.4B XID window, force_vacuum tables have XID scores of 1.0–8.0 (age/freeze_max_age), while typical active tables accumulate dead-tuple scores of 18–70+ within hours of their last vacuum. The exponential boost doesn't activate until failsafe age, so force_vacuum tables are systematically outranked by routine bloat cleanup for what could be days or weeks in production.
I tried to model this in a stress test with 100 tables competing for 3 workers over 7 days, the v12 score-based scheduler actually performed worse than OID order for wraparound exposure:
Algorithm Avg exposure Peak concurrent risk
─────────────────────────────────────────────────
OID 7194 ± 816 min 82 tables
Score 7892 ± 0 min 20 tables
Tiered 4 ± 0 min 5 tables
The score-based scheduler reduced peak concurrent risk (20 vs 82), which is good, but average per-table exposure increased 10% because force_vacuum tables were starved. In 15 out of 20 runs with randomized OIDs, the score scheduler performed worse than OID order.
> Having said that, I'd not realised that Nathan capped the new GUCs at
> 1.0. I think we should allow those to be set higher, likely at least
> to 10.0.
That would definitely help. If autovacuum_freeze_score_weight could be set to 8.0–10.0, DBAs could manually restore the priority we want.
> Maybe we could consider adjusting the code that's setting the
> xid_score/mxid_score so that we start scaling the score aggressively
> when if (xid_age >= effective_xid_failsafe_age /
> Max(autovacuum_freeze_score_weight,1.0)) becomes true
This is clever, it would make the aggressive scaling kick in earlier when the weight is higher. At weight=8.0, you'd get exponential boost starting at 200M (failsafe/8) instead of 1.6B.
Both of these approaches would work. The tiered-sorting proposal was motivated by simplicity. The code already treats wraparound as categorically different (force_vacuum bypasses av_enabled, triggers emergency behavior, can't be disabled per-table). Making the sort order reflect that same categorical distinction felt more aligned with the existing logic than trying to tune scoring weights to create the same effect.
But I'm not religious about it, and I don't have a strong intuition for which would be easier for DBAs to grok and use or for us to maintain so if you do I'll follow your lead. If raising the GUC caps to 10.0 and adding the / Max(weight, 1.0) scaling factor achieves the same goal with less conceptual change, that works for me. The key issue is ensuring force_vacuum tables don't get starved by high-scoring bloat work, and either approach solves that.
> Then, if people
> want to play it safer, then they can set
> autovacuum_freeze_score_weight = 2.0 and have the aggressive scaling
> kick in at 800 million, or whatever half of effective_xid_failsafe_age
> is set to. You could set yours to 8.0, if you really want tables over
> autovacuum_freeze_max_age to take priority over everything else. I
> just don't see or understand the reason why you'd want to.
>
> It's a fairly common misconception that a wraparound vacuum is
> something to be alarmed about. Maybe you've fallen for that?
Not alarmed, but I think the system should process them promptly once they're flagged as force_vacuum, rather than letting them queue behind routine work for potentially days. My simulation suggests the current default weights don't achieve that.
> I recall
> a few proposals to adjust the wording that's shown in pg_stat_activity
> to make them seem less alarming.
>
> David
>
> [1]
> https://www.postgresql.org/message-id/CAApHDvqobtKMwJbhKB_c%3D3-TM%3DTgS3bcuvzcWMm3ee1c0mz9hw%40mail.gmail.com
Attached is autovacuum_simulation_v3.py which implements your suggestions as a fourth mode called 'dynamic', it has:
- raised GUC caps: the dynamic mode uses weight=8.0
- exponential scaling at: dynamic_xid_threshold = c.effective_xid_failsafe / max(weight, 1.0)
With weight=8.0, this means v12 exponential boost starts at 1.6B XIDs and dynamic starts at 200M XIDs (1.6B / 8.0). Did I capture your suggestion accurately?
I ran the simulation comparing your dynamic scaling approach (weight=8.0, exponential boost at 200M) against tiered sorting. The good news: dynamic scaling is a massive improvement over v12, wraparound exposure is about 38 minutes.
The difference between dynamic and tiered shows up in tables that cross freeze_max_age during the simulation: with score-only sorting, they still compete with high-scoring active tables until their XID ages grow large enough, resulting in wraparound exposures of 24-102 minutes.
Tiered sorting processes them immediately upon crossing the threshold, keeping exposure at 3-5 minutes regardless of when they cross.
Both approaches solve the v12 problem and IMO all three are an improvement over what we ship today. I think some form of this patch should make it into v19.
best.
-greg
$ ./autovacuum_simulation_v3.py
================================================================================
FOUR-WAY AUTOVACUUM SCHEDULING COMPARISON
OID vs v12 Score vs Dynamic Scaling (weight=8.0) vs Tiered Sort
================================================================================
Config: 3 workers, 7-day sim, 60s steps, 20 runs
Tables: 5 critical + 15 aging + 80 active = 100
freeze_max_age = 200,000,000
Estimated runtime: 3-8 minutes
Run OID avg Score avg Dynamic avg Tiered avg
--------------------------------------------------------
1 7222m 7892m 38m 4m
2 7642m 7892m 38m 4m
3 7961m 7892m 38m 4m
4 6333m 7892m 38m 4m
5 8110m 7892m 38m 4m
6 6359m 7892m 38m 4m
7 6629m 7892m 38m 4m
8 8526m 7892m 38m 4m
9 6385m 7892m 38m 4m
10 6813m 7892m 38m 4m
11 8588m 7892m 38m 4m
12 7261m 7892m 38m 4m
13 6682m 7892m 38m 4m
14 8035m 7892m 38m 4m
15 5667m 7892m 38m 4m
16 7595m 7892m 38m 4m
17 6394m 7892m 38m 4m
18 6686m 7892m 38m 4m
19 7819m 7892m 38m 4m
20 7178m 7892m 38m 4m
========================================================================
AGGREGATE RESULTS
========================================================================
Avg exposure per run (minutes):
OID : 7194 ± 816 (min=5667, max=8588)
Score : 7892 ± 0 (min=7892, max=7892)
Dynamic : 38 ± 0 (min=38, max=38)
Tiered : 4 ± 0 (min=4, max=4)
Peak concurrent force_vacuum tables:
OID : 82 ± 3 (min=79, max=88)
Score : 20 ± 0 (min=20, max=20)
Dynamic : 5 ± 0 (min=5, max=5)
Tiered : 5 ± 0 (min=5, max=5)
Pairwise wins (lower avg exposure = better):
Score beats OID: 5/20 loses: 15/20 ties: 0/20
Dynamic beats OID: 20/20 loses: 0/20 ties: 0/20
Tiered beats OID: 20/20 loses: 0/20 ties: 0/20
Dynamic beats Score: 20/20 loses: 0/20 ties: 0/20
Tiered beats Score: 20/20 loses: 0/20 ties: 0/20
Tiered beats Dynamic: 20/20 loses: 0/20 ties: 0/20
Variance (std dev of avg exposure across runs):
OID : 816 min
Score : 0 min
Dynamic : 0 min
Tiered : 0 min
Per-table mean exposure (minutes):
Table OID Score Dynamic Tiered
------------------------------------------------------------------------------
critical_0 7564±4470 10080±0 7±0 7±0
critical_1 8577±3671 10080±0 7±0 7±0
critical_2 9073±3099 8938±0 3±0 3±0
critical_3 8581±3661 4±0 4±0 4±0
critical_4 9167±2826 3±0 3±0 3±0
aging_0 7253±4256 9648±0 95±0 4±0
aging_1 7324±3675 9114±0 24±0 4±0
aging_2 6865±3517 8579±0 86±0 5±0
aging_3 7253±3714 8851±0 18±0 4±0
aging_4 6018±4014 8579±0 47±0 4±0
aging_5 6856±2951 8064±0 90±0 4±0
aging_6 6799±3482 8496±0 45±0 3±0
aging_7 5765±4318 8853±0 21±0 4±0
aging_8 7091±3227 8644±0 6±0 4±0
aging_9 6358±2738 7479±0 102±0 4±0
aging_10 7106±3620 8870±0 36±0 4±0
aging_11 7175±3673 8965±0 65±0 4±0
aging_12 6499±3299 8106±0 51±0 4±0
aging_13 6080±3108 7595±0 28±0 4±0
aging_14 6479±3913 8891±0 27±0 4±0
========================================================================
Completed in 149 seconds (2.5 minutes)
========================================================================
Generating visualization...
✓ output/four_way_comparison.png
Done.
| Attachment | Content-Type | Size |
|---|---|---|
| autovacuum_simulation_v3.py | text/x-python | 27.2 KB |
| wraparound_risk_comparison.png | image/png | 377.4 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Shinya Kato | 2026-03-19 13:58:18 | Re: pg_stat_replication.*_lag sometimes shows NULL during active replication |
| Previous Message | Daniel Gustafsson | 2026-03-19 13:49:21 | Re: Changing the state of data checksums in a running cluster |