Re: autovacuum starvation

From: Jim Nasby <decibel(at)decibel(dot)org>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: autovacuum starvation
Date: 2007-05-05 21:54:41
Message-ID: 8ABA01BB-D248-4135-8F7E-346A102B0E50@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On May 2, 2007, at 5:39 PM, Alvaro Herrera wrote:
> The recently discovered autovacuum bug made me notice something
> that is
> possibly critical. The current autovacuum code makes an effort not to
> leave workers in a "starting" state for too long, lest there be
> failure
> to timely tend all databases needing vacuum.
>
> This is how the launching of workers works:
> 1) the launcher puts a pointer to a WorkerInfo entry in shared memory,
> called "the starting worker" pointer
> 2) the launcher sends a signal to the postmaster
> 3) the postmaster forks a worker
> 4) the new worker checks the starting worker pointer
> 5) the new worker resets the starting worker pointer
> 6) the new worker connects to the given database and vacuums it
>
> The problem is this: I originally added some code in the autovacuum
> launcher to check that a worker does not take "too long" to start.
> This
> is autovacuum_naptime seconds. If this happens, the launcher
> resets the
> starting worker pointer, which means that the newly starting worker
> will
> not see anything that needs to be done and exit quickly.
>
> The problem with this is that on a high load machine, for example
> lionfish during buildfarm runs, this would cause autovacuum starvation
> for the period in which the high load is sustained. This could prove
> dangerous.
>
> The problem is that things like fork() failure cannot be communicated
> back to the launcher. So when the postmaster tries to start a process
> and it fails for some reason (failure to fork, or out of memory) we
> need
> a way to re-initiate the worker that failed.
>
> The current code resets the starting worker pointer, and leave the
> slot
> free for another worker, maybe in another database, to start.
>
> I recently added code to resend the postmaster signal when the
> launcher
> sees the starting worker pointer not invalid -- step 2 above. I think
> this is fine, but
>
> 1) we should remove the logic to remove the starting worker
> pointer. It
> is not needed, because database-local failures will be handled by
> subsequent checks
>
> 2) we should leave the logic to resend the postmaster, but we should
> make an effort to avoid sending it too frequently
>
> Opinions?
>
> If I haven't stated the problem clearly please let me know and I'll
> try
> to rephrase.

Isn't there some way to get the postmaster to signal the launcher?
Perhaps stick an error code in shared memory and send it a signal?
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2007-05-05 21:55:58 Re: [COMMITTERS] pgsql: Teach tuplesort.c about "top N" sorting, in which only the first
Previous Message Andrew Dunstan 2007-05-05 20:00:05 Re: iterating over relation's attributes