Re: [HACKERS] Restricting maximum keep segments by repslots

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: alvherre(at)2ndquadrant(dot)com
Cc: jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com
Subject: Re: [HACKERS] Restricting maximum keep segments by repslots
Date: 2020-04-08 00:37:10
Message-ID: 20200408.093710.447591748588426656.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for committing this.

At Tue, 7 Apr 2020 18:45:22 -0400, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in
> On 2020-Apr-07, Kyotaro Horiguchi wrote:
>
> > > Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot?
> >
> > I did that in the attached. The invalidated is shared-but-not-saved
> > member of a slot and initialized to false then irreversibly changed to
> > true when the slot loses required segment.
> >
> > It is checked by the new function CheckReplicationSlotInvalidated() at
> > acquireing a slot and at updating slot by standby reply message. This
> > change stops walsender without explicitly killing but I didn't remove
> > that code.
>
> This change didn't work well with my proposed change to make
> checkpointer acquire slots before marking them invalid. When I
> incorporated your patch in the last version I posted yesterday, there
> was a problem that when checkpointer attempted to acquire the slot, it
> would fail with "the slot is invalidated"; also if you try to drop the
> slot, it would obviously fail. I think it would work to remove the
> SlotIsInvalidated check from the Acquire routine, and instead move it to
> the routines that need it (ie. not the InvalidateObsolete one, and also
> not the routine to drop slots).
>
> I pushed version 26, with a few further adjustments.
>
> I think what we have now is sufficient, but if you want to attempt this
> "invalidated" flag on top of what I pushed, be my guest.

I don't think the invalidation flag is essential but it can prevent
unanticipated behavior, in other words, it makes us feel at ease:p

After the current master/HEAD, the following steps causes assertion
failure in xlogreader.c.

P(ublisher) $ vi $PGDATA/postgresql.conf
wal_level=logical
max_slot_wal_keep_size=0
^Z
(start publisher and subscriber)

P=> create table t(a int);
P=> create publication p1 for table t;
S=> create table t(a int);
P=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint;
(publisher crashes)

2020-04-08 09:20:16.893 JST [9582] LOG: invalidating slot "s1" because its restart_lsn 0/1571770 exceeds max_slot_wal_keep_size
2020-04-08 09:20:16.897 JST [9496] LOG: database system is ready to accept connections
2020-04-08 09:20:21.472 JST [9597] LOG: starting logical decoding for slot "s1"
2020-04-08 09:20:21.472 JST [9597] DETAIL: Streaming transactions committing after 0/1571770, reading WAL from 0/0.
TRAP: FailedAssertion("!XLogRecPtrIsInvalid(RecPtr)", File: "xlogreader.c", Line: 235)
postgres: walsender horiguti [local] idle(ExceptionalCondition+0xa8)[0xaac4c1]
postgres: walsender horiguti [local] idle(XLogBeginRead+0x30)[0x588dbf]
postgres: walsender horiguti [local] idle[0x8c938b]
postgres: walsender horiguti [local] idle(exec_replication_command+0x311)[0x8c9c75]
postgres: walsender horiguti [local] idle(PostgresMain+0x79a)[0x92f091]
postgres: walsender horiguti [local] idle[0x87eec3]
postgres: walsender horiguti [local] idle[0x87e69a]
postgres: walsender horiguti [local] idle[0x87abc2]
postgres: walsender horiguti [local] idle(PostmasterMain+0x11cd)[0x87a48f]
postgres: walsender horiguti [local] idle[0x7852cb]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7fc190958873]
postgres: walsender horiguti [local] idle(_start+0x2e)[0x48169e]
2020-04-08 09:20:22.255 JST [9496] LOG: server process (PID 9597) was terminated by signal 6: Aborted
2020-04-08 09:20:22.255 JST [9496] LOG: terminating any other active server processes
2020-04-08 09:20:22.256 JST [9593] WARNING: terminating connection because of crash of another server process
2020-04-08 09:20:22.256 JST [9593] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

I will look at it.

On the other hand, physical replication doesn't break by invlidation.

Primary: postgres.conf
max_slot_wal_keep_size=0
Standby: postgres.conf
primary_conninfo='connect to master'
primary_slot_name='x1'

(start the primary)
P=> select pg_create_physical_replication_slot('x1');
(start the standby)
S=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint;

(primary log)
2020-04-08 09:35:09.719 JST [10064] LOG: terminating walsender 10076 because replication slot "x1" is too far behind
2020-04-08 09:35:09.719 JST [10076] FATAL: terminating connection due to administrator command
2020-04-08 09:35:09.720 JST [10064] LOG: invalidating slot "x1" because its restart_lsn 0/B9F2000 exceeds max_slot_wal_keep_size
(standby)
[10075] 2020-04-08 09:35:09.723 JST FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[10101] 2020-04-08 09:35:09.734 JST LOG: started streaming WAL from primary at 0/C000000 on timeline 1

Doesn't harm but something's strange. I'll look it, too.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-04-08 00:39:50 Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Previous Message Jeff Davis 2020-04-08 00:21:00 Re: Make MemoryContextMemAllocated() more precise