Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

From: Shayon Mukherjee <shayonj(at)gmail(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: Robert Treat <rob(at)xzilla(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch
Date: 2025-05-08 20:58:35
Message-ID: CANqtF-q7x-aw84OtiTZ-VkLuafmutzgr-Lk3LeeiKs+zFm2Mng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for engaging into the conversation and sharing your thoughts Sami
and Robert.

On Thu, May 8, 2025 at 2:57 PM Sami Imseih <samimseih(at)gmail(dot)com> wrote:

> > actually determine you are in this situation, or how bad the situation
> > was, in order to know that setting truncate off would help? To that
>
>
That's definitely a sound idea and the way I was able to replicate and
learn myself was by doing a PATCH that would just print the # of attempts
it was doing. That said, just to add more color - I was coming at this from
a point of view where truncation would be on a "best effort" basis and the
trade off here would be less availability risk at the cost of missed
truncations. My first instinct was that, having vacuum_truncate set to ON
by default on busy systems that have a lot of churn can catch a lot of
developers and PostgreSQL by surprise (sure did to me :D), so the best
effort and "fail fast" behavior could help, however at the same time I
agree that it doesn't solve either of the problems really well. Another
idea I had around this section was to expose an attribute/guc/setting like
VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES with a default of `1` or
something, that users could optionally configure, perhaps that could be a
middleground, but I am myself not fully sold on the idea either (?).

> Adding counters for this area is not a bad idea in general, as this hits
> customers particularly hard on hot standbys when the truncate does actually
> occur on the primary.
>
> What about adding cumulative counters ( per table and pg_stat_database
> ) such as:
>
> pages_vac_truncated - # of pages truncated by vacuum
> vac_truncate_conflicts - # of time truncate was skipped due to conflict
> vac_truncate_suspended - # of times the truncate was suspended.
>
> The difference between conflict and suspended is conflict causes vacuum
> to skip the truncate phase while suspended causes vacuum to retry the
> phase continually.
>

Would vac_truncate_retries or something similar be more immediately clear?
If so, maybe something like

- vac_truncate_interruptions - # of time truncate was skipped due to
conflict and not retried
- vac_truncate_retries - # of times the truncate was suspended and retried

> The argument against adding these counters is that the views,
> pg_stat_all_tables/
> pg_stat_all_database are becoming super-wide, so maybe we need to think
> about inventing a new view for vacuum related counter metrics. This seems
> like
> a good discussion for v19.
>
>
+1 for this. I'd even be happy to even just start showing the attempts as
part of INFO or VERBOSE on conflicts & suspensions, without keeping track,
but it is useful statistical data.

Thanks
Shayon

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Sami Imseih 2025-05-08 20:50:32 Re: queryId constant squashing does not support prepared statements