From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Timeout failure in 019_replslot_limit.pl |
Date: | 2021-09-27 06:23:07 |
Message-ID: | CAA4eK1JHQEAfsxYqZDrToNiW8KAZ-bDKo-VtXQeR+nyMGF19vg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Sep 27, 2021 at 11:32 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote:
> > Now, in the failed run, it appears that due to some reason WAL sender
> > has not released the slot. Is it possible to see if the WAL sender is
> > still alive when a checkpoint is stuck at ConditionVariableSleep? And
> > if it is active, what is its call stack?
>
> I got again a failure today, so I have used this occasion to check that
> when the checkpoint gets stuck the WAL sender process getting SIGCONT
> is still around, waiting for a write to happen:
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
> frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10
> frame #1: 0x000000010fe50a43 postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1, occurred_events=0x00007ffee0395fd0, nevents=1) at latch.c:1601:7
> frame #2: 0x000000010fe4ffd0 postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1, occurred_events=0x00007ffee0395fd0, nevents=1, wait_event_info=100663297) at latch.c:1396:8
> frame #3: 0x000000010fc586c4 postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, len=122694) at be-secure.c:298:3
..
..
> frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", username="mpaquier") at postgres.c:4493:12
>
> It logs its FATAL "terminating connection due to administrator
> command" coming from ProcessInterrupts(), and then it sits idle on
> ClientWrite.
>
So, it seems on your machine it has passed the following condition in
secure_write:
if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
If so, this indicates write failure which seems odd to me and probably
something machine-specific or maybe some different settings in your
build or machine. BTW, if SSL or GSS is enabled that might have caused
it in some way. I think the best way is to debug the secure_write
during this occurrence.
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2021-09-27 06:43:52 | Re: Timeout failure in 019_replslot_limit.pl |
Previous Message | Michael Paquier | 2021-09-27 06:02:25 | Re: Timeout failure in 019_replslot_limit.pl |