Re: Should vacuum process config file reload more often

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Daniel Gustafsson <daniel(at)yesql(dot)se>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Re: Should vacuum process config file reload more often
Date: 2023-04-06 06:39:43
Message-ID: CAD21AoAbsGrFwr9Hn_4ytMUf4u1SBXG7Z2BDCx_ERrBfjEc0eg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 6, 2023 at 12:29 AM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> Thanks all for the reviews.
>
> v16 attached. I put it together rather quickly, so there might be a few
> spurious whitespaces or similar. There is one rather annoying pgindent
> outlier that I have to figure out what to do about as well.
>
> The remaining functional TODOs that I know of are:
>
> - Resolve what to do about names of GUC and vacuum variables for cost
> limit and cost delay (since it may affect extensions)
>
> - Figure out what to do about the logging message which accesses dboid
> and tableoid (lock/no lock, where to put it, etc)
>
> - I see several places in docs which reference the balancing algorithm
> for autovac workers. I did not read them in great detail, but we may
> want to review them to see if any require updates.
>
> - Consider whether or not the initial two commits should just be
> squashed with the third commit
>
> - Anything else reviewers are still unhappy with
>
>
> On Wed, Apr 5, 2023 at 1:56 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Wed, Apr 5, 2023 at 5:05 AM Melanie Plageman
> > <melanieplageman(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Apr 4, 2023 at 4:27 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > ---
> > > > - if (worker->wi_proc != NULL)
> > > > - elog(DEBUG2, "autovac_balance_cost(pid=%d
> > > > db=%u, rel=%u, dobalance=%s cost_limit=%d, cost_limit_base=%d,
> > > > cost_delay=%g)",
> > > > - worker->wi_proc->pid,
> > > > worker->wi_dboid, worker->wi_tableoid,
> > > > - worker->wi_dobalance ? "yes" : "no",
> > > > - worker->wi_cost_limit,
> > > > worker->wi_cost_limit_base,
> > > > - worker->wi_cost_delay);
> > > >
> > > > I think it's better to keep this kind of log in some form for
> > > > debugging. For example, we can show these values of autovacuum workers
> > > > in VacuumUpdateCosts().
> > >
> > > I added a message to do_autovacuum() after calling VacuumUpdateCosts()
> > > in the loop vacuuming each table. That means it will happen once per
> > > table. It's not ideal that I had to move the call to VacuumUpdateCosts()
> > > behind the shared lock in that loop so that we could access the pid and
> > > such in the logging message after updating the cost and delay, but it is
> > > probably okay. Though noone is going to be changing those at this
> > > point, it still seemed better to access them under the lock.
> > >
> > > This does mean we won't log anything when we do change the values of
> > > VacuumCostDelay and VacuumCostLimit while vacuuming a table. Is it worth
> > > adding some code to do that in VacuumUpdateCosts() (only when the value
> > > has changed not on every call to VacuumUpdateCosts())? Or perhaps we
> > > could add it in the config reload branch that is already in
> > > vacuum_delay_point()?
> >
> > Previously, we used to show the pid in the log since a worker/launcher
> > set other workers' delay costs. But now that the worker sets its delay
> > costs, we don't need to show the pid in the log. Also, I think it's
> > useful for debugging and investigating the system if we log it when
> > changing the values. The log I imagined to add was like:
> >
> > @@ -1801,6 +1801,13 @@ VacuumUpdateCosts(void)
> > VacuumCostDelay = vacuum_cost_delay;
> >
> > AutoVacuumUpdateLimit();
> > +
> > + elog(DEBUG2, "autovacuum update costs (db=%u, rel=%u,
> > dobalance=%s, cost_limit=%d, cost_delay=%g active=%s failsafe=%s)",
> > + MyWorkerInfo->wi_dboid, MyWorkerInfo->wi_tableoid,
> > + pg_atomic_unlocked_test_flag(&MyWorkerInfo->wi_dobalance)
> > ? "no" : "yes",
> > + VacuumCostLimit, VacuumCostDelay,
> > + VacuumCostDelay > 0 ? "yes" : "no",
> > + VacuumFailsafeActive ? "yes" : "no");
> > }
> > else
> > {
>
> Makes sense. I've updated the log message to roughly what you suggested.
> I also realized I think it does make sense to call it in
> VacuumUpdateCosts() -- only for autovacuum workers of course. I've done
> this. I haven't taken the lock though and can't decide if I must since
> they access dboid and tableoid -- those are not going to change at this
> point, but I still don't know if I can access them lock-free...
> Perhaps there is a way to condition it on the log level?
>
> If I have to take a lock, then I don't know if we should put these in
> VacuumUpdateCosts()...

I think we don't need to acquire a lock there as both values are
updated only by workers reporting this message. Also I agree with
where to put the log but I think the log message should start with
lower cases:

+ elog(DEBUG2,
+ "Autovacuum VacuumUpdateCosts(db=%u, rel=%u,
dobalance=%s, cost_limit=%d, cost_delay=%g active=%s failsafe=%s)",
+ MyWorkerInfo->wi_dboid, MyWorkerInfo->wi_tableoid,
+
pg_atomic_unlocked_test_flag(&MyWorkerInfo->wi_dobalance) ? "no" :
"yes",
+ VacuumCostLimit, VacuumCostDelay,
+ VacuumCostDelay > 0 ? "yes" : "no",
+ VacuumFailsafeActive ? "yes" : "no");

Some minor comments on 0003:

+/*
+ * autovac_recalculate_workers_for_balance
+ * Recalculate the number of workers to consider, given
cost-related
+ * storage parameters and the current number of active workers.
+ *
+ * Caller must hold the AutovacuumLock in at least shared mode to access
+ * worker->wi_proc.
+ */

Does it make sense to add Assert(LWLockHeldByMe(AutovacuumLock)) at
the beginning of this function?

---
/* rebalance in case the default cost parameters changed */
- LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
- autovac_balance_cost();
+ LWLockAcquire(AutovacuumLock, LW_SHARED);
+ autovac_recalculate_workers_for_balance();
LWLockRelease(AutovacuumLock);

Do we really need to have the autovacuum launcher recalculate
av_nworkersForBalance after reloading the config file? Since the cost
parameters are not used inautovac_recalculate_workers_for_balance()
the comment also needs to be updated.

---
+ /*
+ * Balance and update limit values for autovacuum
workers. We must
+ * always do this in case the autovacuum launcher or another
+ * autovacuum worker has recalculated the number of
workers across
+ * which we must balance the limit. This is done by
the launcher when
+ * launching a new worker and by workers before
vacuuming each table.
+ */
+ AutoVacuumUpdateCostLimit();

I think the last sentence is not correct. IIUC recalculation of
av_nworkersForBalance is done by the launcher after a worker finished
and by workers before vacuuming each table.

---
It's not a problem of this patch, but IIUC since we don't reset
wi_dobalance after vacuuming each table we use the last value of
wi_dobalance for performing autovacuum items. At end of the loop for
tables in do_autovacuum() we have the following code that explains why
we don't reset wi_dobalance:

/*
* Remove my info from shared memory. We could, but intentionally
* don't, unset wi_dobalance on the assumption that we are more likely
* than not to vacuum a table with no cost-related storage parameters
* next, so we don't want to give up our share of I/O for a very short
* interval and thereby thrash the global balance.
*/
LWLockAcquire(AutovacuumScheduleLock, LW_EXCLUSIVE);
MyWorkerInfo->wi_tableoid = InvalidOid;
MyWorkerInfo->wi_sharedrel = false;
LWLockRelease(AutovacuumScheduleLock);

Assuming we agree with that, probably we need to reset it to true
after vacuuming all tables?

0001 and 0002 patches look good to me except for the renaming GUCs
stuff as the discussion is ongoing.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2023-04-06 06:40:15 Re: Using each rel as both outer and inner for JOIN_ANTI
Previous Message Richard Guo 2023-04-06 06:37:03 Re: Using each rel as both outer and inner for JOIN_ANTI