Re: Server Crash while executing pg_replication_slot_advance (second time)

From: Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
To: tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Server Crash while executing pg_replication_slot_advance (second time)
Date: 2018-02-17 02:26:20
Message-ID: 873720e4hf.fsf@ars-thinkpad
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello,

I confirm this bug. The idea is that while usually we start decoding
from safe data.restart_lsn point, here we don't care about consistent
snapshots and rush into decoding right away from data.confirmed_flush
(slotfunc.c:475). The latter points to the first page's header instead
of valid record if we log SWITCH record and confirm it without any
additional records (while doing this manually you'd better hurry up to
outrun xl_running_xacts slipping through). This can be reproduced with a
bit simpler sample:

SELECT * FROM pg_create_logical_replication_slot(
'regression_slot1', 'test_decoding', true);
select pg_switch_wal();
-- ensure we are on clean segment and xl_running_xacts didn't slip yet
select pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
0/2000000
(1 row)

-- set confirmed_flush_lsn to segment start and verify that
select pg_replication_slot_advance('regression_slot1', '0/6000000');
select slot_name, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn
from pg_replication_slots;
slot_name | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
------------------+------+--------------+-------------+---------------------
regression_slot1 | | 557 | 0/15D0EE8 | 0/2000000
(1 row)

-- insert some wal to advance GetFlushRecPtr
create table t (i int);
-- now start decoding from start of segment, boom
select pg_replication_slot_advance('regression_slot1', '0/6000000');
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
: @:!>
: @:!>
: @:!> \q

Simple way to fix that is to start decoding traditionally from
restart_lsn which certainly points to valid record. Attached patch shows
the idea. However, I am not sure this is a right decision: decoding from
restart_lsn is slower since it lags behind confirmed_lsn; and we really
don't care about consistent snapshots in this function. Well, it would
be good if we assemble one on our way, serialize it and advance
restart_lsn -- and this is AFAIU the main reason we ever decode
something at all instead of just calling LogicalConfirmReceivedLocation
-- but that's not the main goal of the function.

Another approach is to notice pointer to page start and replace it
with ptr to first record, but that doesn't sound elegantly.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
0001-Start-decoding-from-restart_lsn-which-definitely-exi.patch text/x-diff 980 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2018-02-17 02:47:22 Re: [HACKERS] why not parallel seq scan for slow functions
Previous Message Andres Freund 2018-02-17 00:03:37 Re: pgsql: Do execGrouping.c via expression eval machinery, take two.