Re: [HACKERS] Restricting maximum keep segments by repslots

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com
Subject: Re: [HACKERS] Restricting maximum keep segments by repslots
Date: 2020-04-06 22:15:55
Message-ID: 20200406221555.GA16211@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020-Apr-06, Alvaro Herrera wrote:

> Lastly, I noticed that we're now changing the slot's restart_lsn to
> Invalid without being the slot's owner, which goes counter to what is
> said in slot.h:
>
> * - Individual fields are protected by mutex where only the backend owning
> * the slot is authorized to update the fields from its own slot. The
> * backend owning the slot does not need to take this lock when reading its
> * own fields, while concurrent backends not owning this slot should take the
> * lock when reading this slot's data.
>
> What this means is that if the slot owner walsender updates the
> restart_lsn to a newer value just as we (checkpointer) are trying to set
> it to Invalid, the owner's value might persist and our value would be
> lost.
>
> AFAICT if we were really stressed about getting this exactly correct,
> then we would have to kill the walsender, wait for it to die, then
> ReplicationSlotAcquire and *then* update
> MyReplicationSlot->data.restart_lsn.

So I had cold feet about the whole business of trying to write a
non-owned replication slot, so I tried to implemented the "exactly
correct" idea above. That's v25 here.

I think there's a race condition in this: if we kill a walsender and it
restarts immediately before we (checkpoint) can acquire the slot, we
will wait for it to terminate on its own. Fixing this requires changing
the ReplicationSlotAcquire API so that it knows not to wait but not
raise error either (so we can use an infinite loop: "acquire, if busy
send signal")

I also include a separate diff for a change that might or might not be
necessary, where xmins reserved by slots with restart_lsn=invalid are
ignored. I'm not yet sure that we should include this, but we should
keep an eye on it.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
v25-0001-Add-WAL-relief-vent-for-replication-slots.patch text/x-diff 23.7 KB
ignore-xmin-invalidated.patch text/x-diff 947 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-04-06 22:16:37 Re: ERROR: invalid input syntax for type circle
Previous Message Tomas Vondra 2020-04-06 22:13:54 Re: [PATCH] Incremental sort (was: PoC: Partial sort)