Re: BUG #15888: Bogus "idle in transaction" state for logical decoding client after creating a slot

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: marko(at)joh(dot)to, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15888: Bogus "idle in transaction" state for logical decoding client after creating a slot
Date: 2019-07-09 20:15:57
Message-ID: 20190709201557.y5yt57z2za2hu6l7@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2019-07-04 13:35:12 +0900, Michael Paquier wrote:
> On Wed, Jul 03, 2019 at 08:03:24AM +0000, PG Bug reporting form wrote:
> > Run:
> >
> > pg_recvlogical -S foo -f /dev/null -d dbname --create-slot --start
> >
> > and pg_stat_activity will show something like this until that session
> > disconnects:
> >
> > which is quite obviously bogus.
>
> I think that's the transaction which is started when exporting the
> snapshot at logical slot creation. Or in short:
> #2 0x000055b5f8e5720f in StartTransactionCommand () at xact.c:2699
> #3 0x000055b5f90fab4e in SnapBuildExportSnapshot
> (builder=0x55b5fa3427b0) at snapbuild.c:575
> #4 0x000055b5f90fe5df in CreateReplicationSlot (cmd=0x55b5fa2914e8)
> at walsender.c:843
>
> SnapBuildExportSnapshot() needs to keep a transaction context opened
> to keep the wanted xmin around. In PG >= 10, this problem gets
> "solved" by using NOEXPORT_SNAPSHOT in the CREATE_REPLICATION_SLOT
> command. One idea would be to enforce a commit in this case from
> CreateReplicationSlot/streamutil.c? We know that pg_recvlogical is
> not going to use this snapshot anyway...

The transaction ought to be finished by exec_replication_command()
calling SnapBuildClearExportedSnapshot(), which then does
AbortCurrentTransaction(). So something's wrong if that's not working
anymore.

I just tested this on a slightly older v12 checkout, and the general
mechanism works:

S1(replication):
postgres[13464][1]=# CREATE_REPLICATION_SLOT foo LOGICAL test_decoding;

S2(plain):
postgres[13317][1]=# SELECT state, wait_event_type, wait_event, backend_xid, backend_xmin FROM pg_stat_activity WHERE pid = 13464;
┌─────────────────────┬─────────────────┬────────────┬─────────────┬──────────────┐
│ state │ wait_event_type │ wait_event │ backend_xid │ backend_xmin │
├─────────────────────┼─────────────────┼────────────┼─────────────┼──────────────┤
│ idle in transaction │ Client │ ClientRead │ (null) │ 1494339463 │
└─────────────────────┴─────────────────┴────────────┴─────────────┴──────────────┘
(1 row)

S1(replication):
postgres[13464][1]=# IDENTIFY_SYSTEM;
┌─────────────────────┬──────────┬──────────────┬──────────┐
│ systemid │ timeline │ xlogpos │ dbname │
├─────────────────────┼──────────┼──────────────┼──────────┤
│ 6704433050907079638 │ 1 │ 3AC/1A25B700 │ postgres │
└─────────────────────┴──────────┴──────────────┴──────────┘

S2(plain):
postgres[13317][1]=# SELECT state, wait_event_type, wait_event, backend_xid, backend_xmin FROM pg_stat_activity WHERE pid = 13464;
┌───────┬─────────────────┬────────────┬─────────────┬──────────────┐
│ state │ wait_event_type │ wait_event │ backend_xid │ backend_xmin │
├───────┼─────────────────┼────────────┼─────────────┼──────────────┤
│ idle │ Client │ ClientRead │ (null) │ (null) │
└───────┴─────────────────┴────────────┴─────────────┴──────────────┘
(1 row)

I *can* reproduce with 9.6. It's worthwhile to note that the
transaction *is* actually aborted, via the path above. As far as I can
tell the problem is "solely" that we never reach a set_ps_display() that
resets the 'idle in transaction' bit, because START_REPLICATION "takes
over" the connection once started, and we don't go through
PostgresMain() again.

The problem doesn't exist in later versions, because there
exec_replication_command() does an explicit
/* Report to pgstat that this process is now idle */
pgstat_report_activity(STATE_IDLE, NULL);
That's due to

commit d02974e32e028fc078d8f5eca1d6a4516efb0aa6
Author: Magnus Hagander <magnus(at)hagander(dot)net>
Date: 2017-12-29 16:19:51 +0100

Properly set base backup backends to active in pg_stat_activity

which is post 9.6. Not sure I really feel a huge appetite for whacking
this around in the back branches.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2019-07-09 20:59:25 Re: BUG #15899: Valgrind detects errors on create gist index
Previous Message Tom Lane 2019-07-09 17:07:18 Re: ALTER TABLE results in "ERROR: could not open relation with OID 43707388"