Re: walsender vs. XLogBackgroundFlush during shutdown

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: walsender vs. XLogBackgroundFlush during shutdown
Date: 2019-05-02 12:35:45
Message-ID: 20190502123545.bxq5c6ziefdcjewu@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 01, 2019 at 07:12:52PM +0200, Alexander Kukushkin wrote:
>Hi,
>
>On Wed, 1 May 2019 at 17:02, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>> OK, so that seems like a separate issue, somewhat unrelated to the issue I
>> reported. And I'm not sure it's a walsender issue - it seems it might be a
>> psycopg2 issue in not reporting the flush properly, no?
>
>Agree, it is a different issue, but I am unsure what to blame,
>postgres or psycopg2.
>Right now in the psycopg2 we confirm more or less every XLogData
>message, but at the same time LSN on the primary is moving forward and
>we get updates with keepalive messages.
>I perfectly understand the need to periodically send updates of flush
>= walEnd (which comes from keepalive). It might happen that there is
>no transaction activity but WAL is still generated and as a result
>replication slot will prevent WAL from being cleaned up.
>From the client side perspective, it confirmed everything that it
>should, but from the postgres side, this is not enough to shut down
>cleanly. Maybe it is possible to change the check (sentPtr ==
>replicatedPtr) to something like (lastMsgSentPtr <= replicatedPtr) or
>it would be unsafe?
>

I don't know.

In general I think it's a bit strange that we're waiting for walsender
processes to catch up even in fast shutdown mode, instead of just aborting
them like other backends. But I assume there are reasons for that. OTOH it
makes us vulnerable to issues like this, when a (presumably) misbehaving
downstream prevents a shutdown.

>> >ptr=0x55cb958dca08, len=94) at be-secure.c:318
>> >#2 0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0,
>> >ptr=0x55cb958dca08, len=94) at be-secure.c:265
>> >#3 0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433
>> >#4 0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409
>> >#5 0x000055cb93dac30b in send_message_to_frontend
>> >(edata=0x55cb942b4380 <errordata>) at elog.c:3317
>> >#6 0x000055cb93da8973 in EmitErrorReport () at elog.c:1481
>> >#7 0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481
>> >#8 0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3
>> >"sending replication keepalive") at elog.c:1376
>> >#9 0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at
>> >walsender.c:3358
>>
>> Is it stuck in the send() call forever, or did you happen to grab
>> this bracktrace?
>
>No, it didn't stuck there. During the shutdown postgres starts sending
>a few thousand keepalive messages per second and receives back so many
>feedback message, therefore the chances of interrupting somewhere in
>the send are quite high.

Uh, that seems a bit broken, perhaps?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2019-05-02 12:40:52 Bad canonicalization for dateranges with 'infinity' bounds
Previous Message Thomas Munro 2019-05-02 10:30:12 Fixing order of resowner cleanup in 12, for Windows