Quick Links

Re: Server Crash while executing pg_replication_slot_advance (second time)

From:	Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
To:	tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Server Crash while executing pg_replication_slot_advance (second time)
Date:	2018-02-17 02:26:20
Message-ID:	873720e4hf.fsf@ars-thinkpad
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello,

I confirm this bug. The idea is that while usually we start decoding
from safe data.restart_lsn point, here we don't care about consistent
snapshots and rush into decoding right away from data.confirmed_flush
(slotfunc.c:475). The latter points to the first page's header instead
of valid record if we log SWITCH record and confirm it without any
additional records (while doing this manually you'd better hurry up to
outrun xl_running_xacts slipping through). This can be reproduced with a
bit simpler sample:

SELECT * FROM pg_create_logical_replication_slot(
'regression_slot1', 'test_decoding', true);
select pg_switch_wal();
-- ensure we are on clean segment and xl_running_xacts didn't slip yet
select pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
0/2000000
(1 row)

-- set confirmed_flush_lsn to segment start and verify that
select pg_replication_slot_advance('regression_slot1', '0/6000000');
select slot_name, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn
from pg_replication_slots;
slot_name | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
------------------+------+--------------+-------------+---------------------
regression_slot1 | | 557 | 0/15D0EE8 | 0/2000000
(1 row)

-- insert some wal to advance GetFlushRecPtr
create table t (i int);
-- now start decoding from start of segment, boom
select pg_replication_slot_advance('regression_slot1', '0/6000000');
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
: @:!>
: @:!>
: @:!> \q

Simple way to fix that is to start decoding traditionally from
restart_lsn which certainly points to valid record. Attached patch shows
the idea. However, I am not sure this is a right decision: decoding from
restart_lsn is slower since it lags behind confirmed_lsn; and we really
don't care about consistent snapshots in this function. Well, it would
be good if we assemble one on our way, serialize it and advance
restart_lsn -- and this is AFAIU the main reason we ever decode
something at all instead of just calling LogicalConfirmReceivedLocation
-- but that's not the main goal of the function.

Another approach is to notice pointer to page start and replace it
with ptr to first record, but that doesn't sound elegantly.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
0001-Start-decoding-from-restart_lsn-which-definitely-exi.patch	text/x-diff	980 bytes

In response to

Server Crash while executing pg_replication_slot_advance (second time) at 2018-02-16 10:34:56 from tushar

Responses

Re: Server Crash while executing pg_replication_slot_advance (second time) at 2018-02-19 05:50:48 from Masahiko Sawada

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2018-02-17 02:47:22	Re: [HACKERS] why not parallel seq scan for slow functions
Previous Message	Andres Freund	2018-02-17 00:03:37	Re: pgsql: Do execGrouping.c via expression eval machinery, take two.