Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date: 2020-09-14 01:27:56
Message-ID: 929224.1600046876@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Probably this requires a relcache inval at the wrong time;
> although we have recent passes from CLOBBER_CACHE_ALWAYS animals,
> so that can't be the whole triggering condition. I wonder whether
> it is relevant that all of the complaining animals are JIT-enabled.

Hmmm ... I take that back. hyrax has indeed passed since this went
in, but *it doesn't run any TAP tests*. So the buildfarm offers no
information about whether the replication tests work under
CLOBBER_CACHE_ALWAYS.

Realizing that, I built an installation that way and tried to run
the subscription tests. Results so far:

* Running 010_truncate.pl by itself passed for me. So there's still
some unexplained factor needed to trigger the buildfarm failures.
(I'm wondering about concurrent autovacuum activity now...)

* Starting over, it appears that 001_rep_changes.pl almost immediately
gets into an infinite loop. It does not complete the third test step,
rather infinitely waiting for progress to be made. The publisher log
shows a repeating loop like

2020-09-13 21:16:05.734 EDT [928529] tap_sub LOG: could not send data to client: Broken pipe
2020-09-13 21:16:05.734 EDT [928529] tap_sub CONTEXT: slot "tap_sub", output plugin "pgoutput", in the commit callback, associated LSN 0/1660628
2020-09-13 21:16:05.843 EDT [928581] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:16:05.861 EDT [928582] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:16:05.929 EDT [928582] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: starting logical decoding for slot "tap_sub"
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL: Streaming transactions committing after 0/1652820, reading WAL from 0/1651B20.
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: logical decoding found consistent point at 0/1651B20
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL: There are no running transactions.
2020-09-13 21:16:21.560 EDT [928600] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:16:37.291 EDT [928610] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:16:52.959 EDT [928627] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:17:06.866 EDT [928636] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:06.934 EDT [928636] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:07.811 EDT [928638] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:07.880 EDT [928638] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:07.881 EDT [928638] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:07.881 EDT [928638] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:08.618 EDT [928641] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:17:08.753 EDT [928642] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:08.821 EDT [928642] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:09.689 EDT [928645] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:09.756 EDT [928645] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:09.757 EDT [928645] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:09.757 EDT [928645] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:09.841 EDT [928582] tap_sub LOG: could not send data to client: Broken pipe
2020-09-13 21:17:09.841 EDT [928582] tap_sub CONTEXT: slot "tap_sub", output plugin "pgoutput", in the commit callback, associated LSN 0/1660628

while the subscriber is repeating

2020-09-13 21:15:01.598 EDT [928528] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:02.178 EDT [928528] ERROR: terminating logical replication worker due to timeout
2020-09-13 21:16:02.179 EDT [920797] LOG: background worker "logical replication worker" (PID 928528) exited with exit code 1
2020-09-13 21:16:02.606 EDT [928571] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:03.117 EDT [928571] ERROR: could not start WAL streaming: ERROR: replication slot "tap_sub" is active for PID 928529
2020-09-13 21:16:03.118 EDT [920797] LOG: background worker "logical replication worker" (PID 928571) exited with exit code 1
2020-09-13 21:16:03.544 EDT [928574] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:04.053 EDT [928574] ERROR: could not start WAL streaming: ERROR: replication slot "tap_sub" is active for PID 928529
2020-09-13 21:16:04.054 EDT [920797] LOG: background worker "logical replication worker" (PID 928574) exited with exit code 1
2020-09-13 21:16:04.479 EDT [928576] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:04.990 EDT [928576] ERROR: could not start WAL streaming: ERROR: replication slot "tap_sub" is active for PID 928529
2020-09-13 21:16:04.990 EDT [920797] LOG: background worker "logical replication worker" (PID 928576) exited with exit code 1
2020-09-13 21:16:05.415 EDT [928579] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:17:05.994 EDT [928579] ERROR: terminating logical replication worker due to timeout

I'm out of patience to investigate this for tonight, but there is
something extremely broken here; maybe more than one something.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-09-14 01:45:49 Re: Avoid incorrect allocation in buildIndexArray
Previous Message Tom Lane 2020-09-13 23:00:26 Re: Gripes about walsender command processing