|From:||Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>|
|Subject:||Gripes about walsender command processing|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
While trying to understand a recent buildfarm failure , I got annoyed
by the fact that a walsender exposes its last SQL command in
pg_stat_activity even when that was a long time ago and what it's been
doing since then is replication commands. This leads to *totally*
misleading reports, for instance in  we read
2020-09-13 19:10:09.632 CEST [5f5e526d.242415:4] LOG: server process (PID 2368853) was terminated by signal 11: Segmentation fault
2020-09-13 19:10:09.632 CEST [5f5e526d.242415:5] DETAIL: Failed process was running: SELECT pg_catalog.set_config('search_path', '', false);
even though the process's own log messages paint a much better picture of
2020-09-13 19:10:07.302 CEST [5f5e526f.242555:1] LOG: connection received: host=[local]
2020-09-13 19:10:07.303 CEST [5f5e526f.242555:2] LOG: replication connection authorized: user=bf application_name=sub2
2020-09-13 19:10:07.303 CEST [5f5e526f.242555:3] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 19:10:07.334 CEST [5f5e526f.242555:4] LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:5] LOG: received replication command: START_REPLICATION SLOT "sub2" LOGICAL 0/0 (proto_version '2', publication_names '"pub2"')
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:6] LOG: starting logical decoding for slot "sub2"
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:7] DETAIL: Streaming transactions committing after 0/159EB38, reading WAL from 0/159EB00.
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:8] LOG: logical decoding found consistent point at 0/159EB00
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:9] DETAIL: There are no running transactions.
I think that walsenders ought to report their current replication command
as though it were a SQL command. They oughta set debug_query_string, too.
While trying to fix this, I also observed that exec_replication_command
fails to clean up the temp context it made for parsing the command string,
if that turns out to be a SQL command. This very accidentally fails to
lead to a long-term memory leak, because that context will be a child of
MessageContext so it'll be cleaned out in the next iteration of
PostgresMain's main loop. But it's still unacceptably sloppy coding
IMHO, and it's closely related to a lot of other randomness in the
function; it'd be better to have a separate early-exit path for SQL
Attached find a proposed fix for these things.
What I'd really like to do is adjust pgstat_report_activity so that
callers are *required* to provide some kind of command string when
transitioning into STATE_RUNNING state, ie something like
Assert(state == STATE_RUNNING ? cmd_str != NULL : cmd_str == NULL);
However, before we could do that, we'd have to clean up the rat's nest
of seemingly-inserted-with-the-aid-of-a-dartboard pgstat_report_activity
calls in replication/logical/worker.c. I couldn't make heads or tails
of what is going on there, so I did not try.
BTW, while I didn't change it here, isn't the second
SnapBuildClearExportedSnapshot call in exec_replication_command just
useless? We already did that before parsing the command.
regards, tom lane
|Next Message||Brar Piening||2020-09-13 20:25:01||Minor documentation error regarding streaming replication protocol|
|Previous Message||David G. Johnston||2020-09-13 19:36:26||Re: Probable documentation errors or improvements|