Re: We need to log aborted autovacuums

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-05 06:55:02
Message-ID: 4D2415C6.6090407@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Josh Berkus wrote:
> I've been trying to diagnose in a production database why certain tables
> never get autovacuumed despite having a substantial % of updates. The
> obvious reason is locks blocking autovacuum from vacuuming the table ...
>

Missed this dicussion when it popped up but have plenty to say about it
now. What I do here is look for such anomolies using
pg_stat_user_tables, that the dead rows number has exceeded the vacuum
threshold. That comparison is logged in the code at DEBUG3:

elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), anl: %.0f
(threshold %.0f)",
NameStr(classForm->relname),
vactuples, vacthresh, anltuples, anlthresh);

But a rough computation isn't too hard to derive in a report, so long as
you haven't customized per-table parameters. As you suggested in your
message, the really bad cases here stick out a whole lot. If you pay
the slightest amount of attention to the dead row percentages they jump
right out at you. This all works easily on any version back to 8.3.
Not having as much relevant data stored in pg_stat_user_tables makes the
problem cases less obvious to spot in older versions.

If I start seeing these badly maintained tables and suspect locking is
getting in the way, I then dump traces from pg_locks+pg_stat_activity
often enough that I can estimate how often someone has an interfering
lock and what they're doing.

Should the log level on this message go up from DEBUG3? I could see
rewriting it so that it logs at DEBUG1 instead when
Log_autovacuum_min_duration is set *and* when the trigger threshold is
crossed, and at DEBUG3 the rest of the time. Given you can derive this
with a bit of work in userland, I don't see this even being justified as
an INFO or LOG level message. Anytime I can script a SQL-level monitor
for something that's easy to tie into Nagios or something, I greatly
prefer that to log file scraping for it anyway.

> What I'd like to do is add some logging code to autovacuum.c so that if
> log_autovacuum is any value other than -1, failure to vacuum due to
> locks gets logged. Does this make sense?
>

The general idea is interesting and probably more productive for the
situation you theorize is happening then messing with the logging
discussed above. But that's not where the code needs to go--the lock
isn't opened until much further down the function call stack. Attached
quickie and only tested for compilation patch probably does what you
want here. Since this would eliminate the messy follow-up step I
sometimes have gone through, dumping pg_locks data to confirm or rule
out locking issues messing with AV processing, I can see some potential
that it may have simplified situations I've ran into in the past. And
it's not out of line with the logging verbosity of similar failure mode
tests that follow it. Right now failure to acquire a lock is just not
considered a log-worthy issue, and I agree that it's worth considering
whether it should be.

If you could gather more info on whether this logging catches the
problem cases you're seeing, that would really be the right test for the
patch's usefulness. I'd give you solid 50/50 odds that you've correctly
diagnosed the issue, and knowing for sure would make advocating for this
logging a pretty easy sell to me at least.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Attachment Content-Type Size
av-lock-failure-v1.diff text/x-patch 565 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2011-01-05 07:09:57 Re: Re: new patch of MERGE (merge_204) & a question about duplicated ctid
Previous Message Jeff Davis 2011-01-05 05:54:16 Re: WIP: Range Types