Re: [HACKERS] Restricting maximum keep segments by repslots

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: alvherre(at)2ndquadrant(dot)com
Cc: jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com
Subject: Re: [HACKERS] Restricting maximum keep segments by repslots
Date: 2020-04-08 05:19:56
Message-ID: 20200408.141956.891237856186513376.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > I pushed version 26, with a few further adjustments.
> >
> > I think what we have now is sufficient, but if you want to attempt this
> > "invalidated" flag on top of what I pushed, be my guest.
>
> I don't think the invalidation flag is essential but it can prevent
> unanticipated behavior, in other words, it makes us feel at ease:p
>
> After the current master/HEAD, the following steps causes assertion
> failure in xlogreader.c.
..
> I will look at it.

Just avoiding starting replication when restart_lsn is invalid is
sufficient (the attached, which is equivalent to a part of what the
invalidated flag did). I thing that the error message needs a Hint but
it looks on the subscriber side as:

[22086] 2020-04-08 10:35:04.188 JST ERROR: could not receive data from WAL stream: ERROR: replication slot "s1" is invalidated
HINT: The slot exceeds the limit by max_slot_wal_keep_size.

I don't think it is not clean.. Perhaps the subscriber should remove
the trailing line of the message from the publisher?

> On the other hand, physical replication doesn't break by invlidation.
>
> Primary: postgres.conf
> max_slot_wal_keep_size=0
> Standby: postgres.conf
> primary_conninfo='connect to master'
> primary_slot_name='x1'
>
> (start the primary)
> P=> select pg_create_physical_replication_slot('x1');
> (start the standby)
> S=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint;

If we don't mind that standby can reconnect after a walsender
termination due to the invalidation, we don't need to do something for
this. Restricting max_slot_wal_keep_size to be larger than a certain
threshold would reduce the chance we see that behavior.

I saw another issue, the following sequence on the primary freezes
when invalidation happens.

=# create table tt(); drop table tt; select pg_switch_wal();create table tt(); drop table tt; select pg_switch_wal();create table tt(); drop table tt; select pg_switch_wal(); checkpoint;

The last checkpoint command is waiting for CV on
CheckpointerShmem->start_cv in RequestCheckpoint(), while Checkpointer
is waiting for the next latch at the end of
CheckpointerMain. new_started doesn't move but it is the same value
with old_started.

That freeze didn't happen when I removed
ConditionVariableSleep(&s->active_cv) in
InvalidateObsoleteReplicationSlots.

I continue investigating it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-walsender-crash-fix.patch text/x-patch 1.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-04-08 05:43:57 Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)
Previous Message Fujii Masao 2020-04-08 05:15:38 Re: backup manifests