Re: Adding REPACK [concurrently]

From: Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Mihail Nikalayeu <mihailnikalayeu(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Treat <rob(at)xzilla(dot)net>
Subject: Re: Adding REPACK [concurrently]
Date: 2026-03-31 17:37:17
Message-ID: CAFC+b6of6_poBQ6EgK8N49VQwAUYX=uLkHiU-TP7y+DamBD5TQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Alvaro,

On Thu, Mar 26, 2026 at 1:42 AM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
wrote:

>
> As for lock upgrade, I wonder if the best way to handle this isn't to
> hack the deadlock detector so that it causes any *other* process to die,
> if they detect that they would block on REPACK. Arguably there's
> nothing that you can do to a table while its undergoing REPACK
> CONCURRENTLY; any alterations would have to wait until the repacking is
> compelted. We can implement that idea simply enough, as shown in this
> crude prototype.
>

After testing this, I observed that it solves the scenario where a query is
waiting
on REPACK. For example, if a DROP TABLE requests an AEL and queues
behind REPACK's ShareUpdateExclusiveLock, the deadlock detector comes
when REPACK tries to upgrade to AEL, killing the DROP to prevent the
circular
queue deadlock, But the case I originally mentioned [1] was the reverse:
what
happens if a transaction already holds a lock that conflicts with the
upcoming
AEL upgrade (e.g., an analytical SELECT or an idle-in-transaction holding
an AccessShareLock),
but isn't waiting on REPACK at all?

In this case, there's no circular wait. The deadlock detector never fires.
REPACK
simply queues behind the SELECT, eventually hits its lock_timeout, aborts
and
cleans up.Initially, I thought this cleanup was expected behavior. But
after seeing
your solution to protect REPACK from losing its transient table work, I
thought it's "not expected".
If the goal is to prevent REPACK's work from being wasted, should we error
out
the backend that is making REPACK wait during the final swap phase? I am
thinking
of something conceptually similar to
ResolveRecoveryConflictWithLock,actively
cancelling the conflicting session to allow the AEL upgrade to proceed.
Thoughts?

test scenario:

session 1:
postgres=# repack (concurrently) stress_victim;
had a breakpoint rebuild_relation_finish_concurrent->
LockRelationOid(old_table_oid, AccessExclusiveLock); just before getting
the exclusive lock.
with lock_timeout = 5s

session 2:
postgres=# BEGIN;
SELECT * FROM stress_victim LIMIT 1;
-- left it open
BEGIN
id | balance |
payload
-----+---------+---------------------------------
-------------------------------------------------
-------------------------------------------------
-------------------------------------------------
--------------
170 | 65 | d12f400c4d0d3c49818f88597e16cf29
d12f400c4d0d3c49818f88597e16cf29d12f400c4d0d3c498
18f88597e16cf29d12f400c4d0d3c49818f88597e16cf29d1
2f400c4d0d3c49818f88597e16cf29d12f400c4d0d3c49818
f88597e16cf29
(1 row)
-- this gets us a conflicting lock (AccessShareLock) on the same table,
REPACK (concurrently) is running on.

session 1:
release the breakpoint and now the backend waits for the conflicting lock
to be released.
in between if lock_timeout occurs then transaction aborts.
postgres=# repack (concurrently) stress_victim;
ERROR: canceling statement due to lock timeout
CONTEXT: waiting for AccessExclusiveLock on relation 16637 of database 5

[1] -
https://www.postgresql.org/message-id/CAFC%2Bb6pK9ogeSpMA8hg18XhC1eNPcsKWBwoC5OySXi4iTxwtRw%40mail.gmail.com

--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2026-03-31 17:39:30 Re: index prefetching
Previous Message Fujii Masao 2026-03-31 17:34:59 Re: Exit walsender before confirming remote flush in logical replication