Re: Inconsistent update in the MERGE command

From: Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
To: Dmitry <dsy(dot)075(at)yandex(dot)ru>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Inconsistent update in the MERGE command
Date: 2025-08-24 17:34:04
Message-ID: 20250825023404.06f96a85bd806b587ca28d3d@sraoss.co.jp
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 21 Aug 2025 16:30:09 +0300
Dmitry <dsy(dot)075(at)yandex(dot)ru> wrote:

> Hi hackers,
>
> I noticed an inconsistent update when executing MERGE commands, which
> looks more like a bug.
> In my test example, the value of 'val' should increase in an ascending
> monotonous sequence.
>
> Test system
> ===========
> - Architecture: x86_64
> - OS: Ubuntu 24.04.3 LTS (Noble Numbat)
> - Tested postgres version(s):
>      - latest 17 (17.6)
>
> Steps to reproduce
> ==================
>   postgres=# create table t_merge (id int primary key, val int);
>   CREATE TABLE
>   postgres=# create table t_merge_chk (val int primary key);
>   CREATE TABLE
>   postgres=# insert into t_merge values (1,0);
>   INSERT 0 1
>
>   pgbench --no-vacuum --exit-on-abort -c 10 --file=/dev/stdin <<'EOF'
>   begin;
>   merge into t_merge t
>   using (select 1 id) s on (t.id = s.id)
>   when matched then update set val = t.val + 1
>   returning val \gset
>
>   -- Checking the uniqueness of a value
>   insert into t_merge_chk (val) values (:val);
>   commit;
>   EOF
>
>   pgbench: error: client 3 script 0 aborted in command 2 query 0:
> ERROR:  duplicate key value violates unique constraint "t_merge_chk_pkey"
>   DETAIL:  Key (val)=(2) already exists.
>
>
> What do you think about this?

I confirmed this issue by executing the following query concurrently
in three transactions. (With only two transactions, the issue does not occur.)

merge into t_merge t
using (select 1 id) s on (t.id = s.id)
when matched then update set val = t.val + 1
returning old.ctid,old.val,new.ctid,new.val;

Theoretically, each transaction should increment the val value
independently, so the total result should be an increment of three.
However, sometimes (not always, but often) the second and third committed
transactions end up producing the same result.

For example, when the first committed transaction returns:

ctid | val | ctid | val
---------+-----+---------+-----
(2,158) | 4 | (2,159) | 5
(1 row)

the second committed transaction could return:

ctid | val | ctid | val
---------+-----+---------+-----
(2,159) | 5 | (2,160) | 6
(1 row)

and the third committed transaction could return:

ctid | val | ctid | val
---------+-----+---------+-----
(2,159) | 5 | (2,161) | 6
(1 row)

This is wrong, since the old.val and new.val should be 6 and 7, respectively.

I don't completely understand how this race condition occurs,
but I believe the bug is due to the misuse of TM_FailureData
returned by table_tuple_lock in ExecMergeMatched().

Currently, TM_FailureData.ctid is used as a reference to the
latest version of oldtuple, but this is not always correct.
Instead, the tupleid passed to table_tuple_lock should be used.

I've attached a patch to fix this.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>

Attachment Content-Type Size
0001-Fix-misuse-of-TM_FailureData.ctid-in-ExecMergeMatche.patch text/x-diff 1.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2025-08-24 18:11:36 Re: Non-reproducible AIO failure
Previous Message Mihail Nikalayeu 2025-08-24 16:52:00 Re: Adding REPACK [concurrently]