Re: [HACKERS] Restricting maximum keep segments by repslots

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: alvherre(at)2ndquadrant(dot)com
Cc: jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com
Subject: Re: [HACKERS] Restricting maximum keep segments by repslots
Date: 2020-04-28 08:18:15
Message-ID: 20200428.171815.1687900483771598932.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Mon, 27 Apr 2020 19:40:07 -0400, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in
> On 2020-Apr-08, Kyotaro Horiguchi wrote:
>
> > I understand how it happens.
> >
> > The latch triggered by checkpoint request by CHECKPOINT command has
> > been absorbed by ConditionVariableSleep() in
> > InvalidateObsoleteReplicationSlots. The attached allows checkpointer
> > use MyLatch for other than checkpoint request while a checkpoint is
> > running.
>
> Hmm, that explanation makes sense, but I couldn't reproduce it with the
> steps you provided. Perhaps I'm missing something.

Sorry for the incomplete reproducer. A checkpoint needs to be running
simultaneously for the manual checkpoint to hang up. The following is
the complete sequence.

1. Build a primary database cluster with the following setup, then start it.
max_slot_wal_keep_size=0
max_wal_size=32MB
min_wal_size=32MB

2. Build a replica from the primary creating a slot, then start it.

$ pg_basebackup -R -C -S s1 -D...

3. Try the following commands. Try several times if it succeeds.
=# create table tt(); drop table tt; select pg_switch_wal();checkpoint;

It is evidently stochastic, but it works quite reliably for me.

> Anyway I think this patch should fix it also -- instead of adding a new
> flag, we just rely on the existing flags (since do_checkpoint must have
> been set correctly from the flags earlier in that block.)

Since the added (!do_checkpoint) check is reached with
do_checkpoint=false at server start and at archive_timeout intervals,
the patch makes checkpointer run a busy-loop at that timings, and that
loop lasts until a checkpoint is actually executed.

What we need to do here is not forgetting the fact that the latch has
been set even if the latch itself gets reset before reaching to
WaitLatch.

> I think it'd be worth to verify this bugfix in a new test. Would you
> have time to produce that? I could try in a couple of days ...

The attached patch on 019_replslot_limit.pl does the commands above
automatically. It sometimes succeed but fails in most cases, at least
for me.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
TAP_checkpoint_freeze.patch text/x-patch 1.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiro Ikeda 2020-04-28 08:42:32 Re: Why are wait events not reported even though it reads/writes a timeline history file?
Previous Message Masahiko Sawada 2020-04-28 07:20:27 Re: Fixes for two separate bugs in nbtree VACUUM's page deletion