Re: Auto-vacuum timing out and preventing connections

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: David Johansen <davejohansen(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Auto-vacuum timing out and preventing connections
Date: 2022-06-30 02:22:35
Message-ID: CAD21AoDL6eAY7+JY6wJ3-N2DY04GCXtWZ+tZNF9OYJeUnPkQEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Jun 29, 2022 at 5:05 AM David Johansen <davejohansen(at)gmail(dot)com> wrote:
>
> On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>
>> On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen(at)gmail(dot)com> wrote:
>>>
>>> We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out and then that prevents new connections from happening. This assumption is based on these logs showing up in the logs:
>>> WARNING: worker took too long to start; canceled
>>> The log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted.
>>
>>
>> As Julien suggested, this sounds like another victim, not the cause. Is there anything else in the log files?
>
>
> That's the only thing in the logs for the 12-24 hours before the database becomes inaccessible.
>
>>
>> What version are you using?
>
>
> 13.6
>
>>>
>>> These are the most similarly related previous posts, but the CPU usage isn't high when this happens, so I don't believe that's the problem
>>> https://www.postgresql.org/message-id/20081105185206.GS4114%40alvh.no-ip.org
>>> https://www.postgresql.org/message-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com
>>
>>
>> But, I don't see high CPU described as a symptom in either of those threads.
>
>
> I was referring to the "I've seen this happen under heavy load" statement. Not sure that's the cause or related in those posts, but it doesn't appear to be the issue here.
>
>>
>> If you can't reproduce the problem locally, there probably isn't much we can do. Maybe ask Amazon to look into it, since they are the only ones with sufficient access to do so.
>
>
> We've opened a support case, but I was trying to be proactive and seeing what we could dig into on our end. Is there a way to tell which table the auto-vacuum is trying to run on and timing out with?

Autovacuum workers launch per database. The situation where the
warning "worker took too long to start; canceled" occurs is that an
autovacuum worker for the particular database took a long time for its
startup phase (initializing and setting parameters etc.). There is no
way to know neither which table nor which database.

If it's reproducible, it may help investigate it if you could collect
the contents of pg_stat_activity when the issue is happening in order
to see if there is another process waiting.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message 396934406@qq.com 2022-07-01 03:35:15 pg15 beta2 bug:cause by logcial replation
Previous Message Andrey Borodin 2022-06-29 18:43:00 Re: pg_upgrade (12->14) fails on aggregate