Re: [HACKERS] More stats about skipped vacuums

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: sawada(dot)mshk(at)gmail(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] More stats about skipped vacuums
Date: 2017-11-15 07:13:01
Message-ID: CAB7nPqQm_WCKuUf5RD0CzeMuMO907ZPKP7mBh-3t2zSJ9jn+PA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 30, 2017 at 8:57 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> At Thu, 26 Oct 2017 15:06:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20171026(dot)150630(dot)115694437(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
>> At Fri, 20 Oct 2017 19:15:16 +0900, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote in <CAD21AoAkaw-u0feAVN_VrKZA5tvzp7jT=mQCQP-SvMegKXHHaw(at)mail(dot)gmail(dot)com>
>> > > n_mod_since_analyze | 20000
>> > > + vacuum_requred | true
>> > > + last_vacuum_oldest_xid | 8023
>> > > + last_vacuum_left_to_truncate | 5123
>> > > + last_vacuum_truncated | 387
>> > > last_vacuum | 2017-10-10 17:21:54.380805+09
>> > > last_autovacuum | 2017-10-10 17:21:54.380805+09
>> > > + last_autovacuum_status | Killed by lock conflict
>> > > ...
>> > > autovacuum_count | 128
>> > > + incomplete_autovacuum_count | 53
>> > >
>> > > # The last one might be needless..
>> >
>> > I'm not sure that the above informations will help for users or DBA
>> > but personally I sometimes want to have the number of index scans of
>> > the last autovacuum in the pg_stat_user_tables view. That value
>> > indicates how efficiently vacuums performed and would be a signal to
>> > increase the setting of autovacuum_work_mem for user.
>>
>> Btree and all existing index AMs (except brin) seem to visit the
>> all pages in every index scan so it would be valuable. Instead
>> the number of visited index pages during a table scan might be
>> usable. It is more relevant to performance than the number of
>> scans, on the other hand it is a bit difficult to get something
>> worth from the number in a moment. I'll show the number of scans
>> in the first cut.
>>
>> > > Where the "Killed by lock conflict" is one of the followings.
>> > >
>> > > - Completed
>> > > - Truncation skipped
>> > > - Partially truncated
>> > > - Skipped
>> > > - Killed by lock conflict
>> > >
>> > > This seems enough to find the cause of a table bloat. The same
>> > > discussion could be applied to analyze but it might be the
>> > > another issue.
>> > >
>> > > There may be a better way to indicate the vacuum soundness. Any
>> > > opinions and suggestions are welcome.
>> > >
>> > > I'm going to make a patch to do the 'formal' one for the time
>> > > being.
>
> Done with small modifications. In the attached patch
> pg_stat_all_tables has the following new columns. Documentations
> is not provided at this stage.
>
> -----
> n_mod_since_analyze | 0
> + vacuum_required | not requried
> last_vacuum |
> last_autovacuum | 2017-10-30 18:51:32.060551+09
> last_analyze |
> last_autoanalyze | 2017-10-30 18:48:33.414711+09
> vacuum_count | 0
> + last_vacuum_truncated | 0
> + last_vacuum_untruncated | 0
> + last_vacuum_index_scans | 0
> + last_vacuum_oldest_xmin | 2134
> + last_vacuum_status | agressive vacuum completed
> + autovacuum_fail_count | 0
> autovacuum_count | 5
> analyze_count | 0
> autoanalyze_count | 1
> -----
> Where each column shows the following infomation.
>
> + vacuum_required | not requried
>
> VACUUM requirement status. Takes the following values.
>
> - partial
> Partial (or normal) will be performed by the next autovacuum.
> The word "partial" is taken from the comment for
> vacuum_set_xid_limits.
>
> - aggressive
> Aggressive scan will be performed by the next autovacuum.
>
> - required
> Any type of autovacuum will be performed. The type of scan is
> unknown because the view failed to take the required lock on
> the table. (AutoVacuumrequirement())
>
> - not required
> Next autovacuum won't perform scan on this relation.
>
> - not required (lock not acquired)
>
> Autovacuum should be disabled and the distance to
> freeze-limit is not known because required lock is not
> available.
>
> - close to freeze-limit xid
> Shown while autovacuum is disabled. The table is in the
> manual vacuum window to avoid anti-wraparound autovacuum.
>
> + last_vacuum_truncated | 0
>
> The number of truncated pages in the last completed
> (auto)vacuum.
>
> + last_vacuum_untruncated | 0
> The number of pages the last completed (auto)vacuum tried to
> truncate but could not for some reason.
>
> + last_vacuum_index_scans | 0
> The number of index scans performed in the last completed
> (auto)vacuum.
>
> + last_vacuum_oldest_xmin | 2134
> The oldest xmin used in the last completed (auto)vacuum.
>
> + last_vacuum_status | agressive vacuum completed
>
> The finish status of the last vacuum. Takes the following
> values. (pg_stat_get_last_vacuum_status())
>
> - completed
> The last partial (auto)vacuum is completed.
>
> - vacuum full completed
> The last VACUUM FULL is completed.
>
> - aggressive vacuum completed
> The last aggressive (auto)vacuum is completed.
>
> - error while $progress
> The last vacuum stopped by error while $progress.
> The $progress one of the vacuum progress phases.
>
> - canceled while $progress
> The last vacuum was canceled while $progress
>
> This is caused by user cancellation of manual vacuum or
> killed by another backend who wants to acquire lock on the
> relation.
>
> - skipped - lock unavailable
> The last autovacuum on the relation was skipped because
> required lock was not available.
>
> - unvacuumable
> A past autovacuum tried vacuum on the relation but it is not
> vacuumable for reasons of ownership or accessibility problem.
> (Such relations are not shown in pg_stat_all_tables..)
>
> + autovacuum_fail_count | 0
> The number of successive failure of vacuum on the relation.
> Reset to zero by completed vacuum.
>
> ======
>
> In the patch, vacrelstats if pointed from a static variable and
> cancel reporting is performed in PG_CATCH() section in vacuum().
> Every unthrown error like lock acquisition failure is reported by
> explicit pgstat_report_vacuum() with the corresponding finish
> code.
>
> Vacuum requirement status is calculated in AutoVacuumRequirment()
> and returned as a string. Access share lock on the target
> relation is required but it returns only available values if the
> lock is not available. I decided to return incomplete (coarse
> grained) result than wait for a lock that isn't known to be
> relased in a short time for a perfect result.

pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_vacuum_necessity(C.oid) AS vacuum_required,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,
pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
+ pg_stat_get_last_vacuum_truncated(C.oid) AS last_vacuum_truncated,
+ pg_stat_get_last_vacuum_untruncated(C.oid) AS
last_vacuum_untruncated,
+ pg_stat_get_last_vacuum_index_scans(C.oid) AS
last_vacuum_index_scans,
+ pg_stat_get_last_vacuum_oldest_xmin(C.oid) AS
last_vacuum_oldest_xmin,
+ pg_stat_get_last_vacuum_status(C.oid) AS last_vacuum_status,
+ pg_stat_get_autovacuum_fail_count(C.oid) AS autovacuum_fail_count,
Please use spaces instead of tabs. Indentation is not consistent.

+ case PGSTAT_VACUUM_CANCELED:
+ phase = tabentry->vacuum_last_phase;
+ /* number of elements of phasestr above */
+ if (phase >= 0 && phase <= 7)
+ result = psprintf("%s while %s",
+ status == PGSTAT_VACUUM_CANCELED ?
+ "canceled" : "error",
+ phasestr[phase]);
Such complication is not necessary. The phase parameter is updated by
individual calls of pgstat_progress_update_param(), so the information
showed here overlaps with the existing information in the "phase"
field.

@@ -210,7 +361,6 @@ pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS)
PG_RETURN_INT64(result);
}

-
Datum
pg_stat_get_blocks_hit(PG_FUNCTION_ARGS)
Noise diff.

Thinking about trying to et something into core by the end of the
commit fest, this patch presents multiple concepts at once which could
be split into separate patches for simplicity:
1) Additional data fields to help in debugging completed vacuums.
2) Tracking of interrupted vacuum jobs in progress table.
3) Get state of vacuum job on error.

However, progress reports are here to allow users to do decisions
based on the activity of how things are working. This patch proposes
to add multiple new fields:
- oldest Xmin.
- number of index scans.
- number of pages truncated.
- number of pages that should have been truncated, but are not truncated.
Among all this information, as Sawada-san has already mentioned
upthread, the more index scans the less dead tuples you can store at
once, so autovacuum_work_mem ought to be increases. This is useful for
tuning and should be documented properly if reported to give
indications about vacuum behavior. The rest though, could indicate how
aggressive autovacuum is able to remove tail blocks and do its work.
But what really matters for users to decide if autovacuum should be
more aggressive is tracking the number of dead tuples, something which
is already evaluated.

Tracking the number of failed vacuum attempts is also something
helpful to understand how much the job is able to complete. As there
is already tracking vacuum jobs that have completed, it could be
possible, instead of logging activity when a vacuum job has failed, to
track the number of *begun* jobs on a relation. Then it is possible to
guess how many have failed by taking the difference between those that
completed properly. Having counters per failure types could also be a
possibility.

For this commit fest, I would suggest a patch that simply adds
tracking for the number of index scans done, with documentation to
give recommendations about parameter tuning. i am switching the patch
as "waiting on author".
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-11-15 08:08:42 Re: [HACKERS] [PATCH] Add recovery_min_apply_delay_reconnect recovery option
Previous Message Beena Emerson 2017-11-15 06:53:42 Re: [HACKERS] Runtime Partition Pruning