Gripes about walsender command processing

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Gripes about walsender command processing
Date: 2020-09-13 19:47:51
Message-ID: 880181.1600026471@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While trying to understand a recent buildfarm failure [1], I got annoyed
by the fact that a walsender exposes its last SQL command in
pg_stat_activity even when that was a long time ago and what it's been
doing since then is replication commands. This leads to *totally*
misleading reports, for instance in [1] we read

2020-09-13 19:10:09.632 CEST [5f5e526d.242415:4] LOG: server process (PID 2368853) was terminated by signal 11: Segmentation fault
2020-09-13 19:10:09.632 CEST [5f5e526d.242415:5] DETAIL: Failed process was running: SELECT pg_catalog.set_config('search_path', '', false);

even though the process's own log messages paint a much better picture of
reality:

2020-09-13 19:10:07.302 CEST [5f5e526f.242555:1] LOG: connection received: host=[local]
2020-09-13 19:10:07.303 CEST [5f5e526f.242555:2] LOG: replication connection authorized: user=bf application_name=sub2
2020-09-13 19:10:07.303 CEST [5f5e526f.242555:3] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 19:10:07.334 CEST [5f5e526f.242555:4] LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:5] LOG: received replication command: START_REPLICATION SLOT "sub2" LOGICAL 0/0 (proto_version '2', publication_names '"pub2"')
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:6] LOG: starting logical decoding for slot "sub2"
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:7] DETAIL: Streaming transactions committing after 0/159EB38, reading WAL from 0/159EB00.
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:8] LOG: logical decoding found consistent point at 0/159EB00
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:9] DETAIL: There are no running transactions.

I think that walsenders ought to report their current replication command
as though it were a SQL command. They oughta set debug_query_string, too.

While trying to fix this, I also observed that exec_replication_command
fails to clean up the temp context it made for parsing the command string,
if that turns out to be a SQL command. This very accidentally fails to
lead to a long-term memory leak, because that context will be a child of
MessageContext so it'll be cleaned out in the next iteration of
PostgresMain's main loop. But it's still unacceptably sloppy coding
IMHO, and it's closely related to a lot of other randomness in the
function; it'd be better to have a separate early-exit path for SQL
commands.

Attached find a proposed fix for these things.

What I'd really like to do is adjust pgstat_report_activity so that
callers are *required* to provide some kind of command string when
transitioning into STATE_RUNNING state, ie something like

Assert(state == STATE_RUNNING ? cmd_str != NULL : cmd_str == NULL);

However, before we could do that, we'd have to clean up the rat's nest
of seemingly-inserted-with-the-aid-of-a-dartboard pgstat_report_activity
calls in replication/logical/worker.c. I couldn't make heads or tails
of what is going on there, so I did not try.

BTW, while I didn't change it here, isn't the second
SnapBuildClearExportedSnapshot call in exec_replication_command just
useless? We already did that before parsing the command.

regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03

Attachment Content-Type Size
clean-up-exec_replication_command.patch text/x-diff 3.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Brar Piening 2020-09-13 20:25:01 Minor documentation error regarding streaming replication protocol
Previous Message David G. Johnston 2020-09-13 19:36:26 Re: Probable documentation errors or improvements