Re: Adding REPACK [concurrently]

From: Antonin Houska <ah(at)cybertec(dot)at>
To: Mihail Nikalayeu <mihailnikalayeu(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Treat <rob(at)xzilla(dot)net>
Subject: Re: Adding REPACK [concurrently]
Date: 2026-01-15 16:36:59
Message-ID: 35686.1768495019@localhost
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Mihail Nikalayeu <mihailnikalayeu(at)gmail(dot)com> wrote:

> Also, there are some crashes of stress tests for v30 (for both single snapshot and multiple snapshot versions).
>
> ---------------------
>
> Looks like something is leaking, but not sure.
>
> https://cirrus-ci.com/task/5577209672368128?logs=test_world#L277 (multiple snapshots)
> https://cirrus-ci.com/task/6439044873191424 (without multiple snapshots)

As the test runs pgbench with --client=30 and the default value of
max_worker_processes is 8, I'm not sure this is a leak. I've increased this
parameter I couldn't see the error anymore.

> This one showed something goes wrong, the sum of the table is broken. It may be 0 because non-MVCC safe, but I checked the logs:
>
> 2026-01-12 18:41:11.656 UTC client backend[76247] 007_repack_concurrently.pl LOG: statement: SELECT (490588) / 0;

I agree that this is due to the missing MVCC safety feature. I commented that
check in the script for now.

Besides that, I saw some deadlocks. I think this was due to the fact that
multiple rows are updated per transaction, and that the keys are random, so it
can happen that two transactions try to update the same rows in different
order. I increased the number of rows in the test table to 10000 and don't see
the deadlocks anymore.

> backend[54349] 007_repack_concurrently.pl ERROR: could not create unique index "tbl_pkey_repacknew"
> 2026-01-12 18:41:12.477 UTC client backend[54349] 007_repack_concurrently.pl DETAIL: Key (i)=(942) is duplicated.
> 2026-01-12 18:41:12.477 UTC client backend[54349] 007_repack_concurrently.pl STATEMENT: REPACK (CONCURRENTLY) tbl;

This is tricky. I could reproduce the problem on my FreeBSD box a few times,
never on Linux (no idea if the OS makes the difference since HW is also quite
different, but CI also seemed to fail more often on FreeBSD.)

Something seems to be wrong about UPDATE, but I'm failing to understand how it
could relate to REPACK. This is an example of a duplicate value i=6118

SELECT i, j, xmin, xmax, ctid FROM tbl WHERE i=6118;
i | j | xmin | xmax | ctid
------+--------+--------+--------+---------
6118 | 445435 | 102317 | 103702 | (1,216)
6118 | 391135 | 103702 | 0 | (56,62)

According to log, xid=102317 is the transaction used by REPACK and xid=103702
one of the test. pageinspect shows that the old version has not only
HEAP_XMIN_COMMITTED in t_infomask, but also HEAP_XMAX_INVALID.

So far I could not reproduce the duplicities with the REPACK (CONCURRENTLY)
command commented out in the test script, but that does not prove much (even
with REPACK, not every run fails). Also I noticed that REPACK incorrectly sets
cmin/cmax to 1 instead of 0 and it needs to be fixed, but I have no idea why
this bug should cause exactly this weird behavior.

I even added quite a few logging messages to reveal where in the code the
HEAP_XMAX_INVALID flag is set for particular ctid, but after a failure I could
not find the message for the problematic tuples. Ideas are appreciated.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-01-15 16:43:14 Re: Buffer locking is special (hints, checksums, AIO writes)
Previous Message Nathan Bossart 2026-01-15 16:08:16 Re: refactor architecture-specific popcount code