'tuple concurrently updated' error w/o visible catalog updates

From: Alex Kliukin <alexk(at)hintbits(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: 'tuple concurrently updated' error w/o visible catalog updates
Date: 2018-05-17 13:49:37
Message-ID: 1526564977.21434.1375558632.2AD636B2@webmail.messagingengine.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Earlier this week we have split our Postgres 9.6.8 shards, each having two
databases, into one database per shard setup. This was done by promoting
replicas and subsequently removing unused databases.

Immediately afterwards we have discovered repeated 'tuple concurrently updated'
errors on most of those new shards.

The error is always shown at the same statement:
ERROR,XX000,"tuple concurrently updated",,,,,"SQL statement ""UPDATE
config_content SET cc_content = l_config_content WHERE cc_config_content_id =
l_ccm_content_id""

By searching the archives (i.e.
https://www.postgresql.org/messageid/flat/CAB7nPqSZCkVfibTvx9TYmHYhVtV_vOMNwOpLHnRU85qeiimUaQ%40mail(dot)gmail(dot)com#CAB7nPqSZCkVfibTvx9TYmHYhVtV_vOMNwOpLHnRU85qeiimUaQ(at)mail(dot)gmail(dot)com)
I’ve got an impression that this error manifests itself when system catalog
tuples are updated concurrently, however I see none of that in the query that
leads to an ERROR. There are no triggers on 'config_content' table, neither
there are any views referring to it.

The errors stopped when we disabled a call to the 'upsert_foo_content' function
(here and below I obfuscated real names). This is a fairly simple pl/pgsql function
that does a few selects and an upsert. The block inside that function that contains
the statement at fault is:

--------------
SELECT ccm_content_id, ccm_simple_update_received_at INTO l_ccm_content_id, l_ccm_simple_update_received_at FROM config_content_metadata
WHERE ccm_config_id = l_c_id AND ccm_sales_channel_id = l_sales_channel_id;

IF (l_ccm_content_id IS NULL) THEN
-- insert config content --
INSERT INTO config_content_metadata(ccm_config_id, ccm_sales_channel_id, ccm_update_caused_by, ccm_simple_update_eid, ccm_simple_update_received_at)
VALUES(l_c_id, l_sales_channel_id, l_rp_id, l_content_update_eid, l_content_update_received_at) RETURNING ccm_content_id INTO l_ccm_content_id;

INSERT INTO config_content(cc_config_content_id, cc_content) VALUES (l_ccm_content_id, l_config_content);

ELSIF (l_ccm_simple_update_received_at < l_content_update_received_at) THEN

UPDATE config_content_metadata
SET ccm_update_caused_by = l_rp_id, ccm_simple_update_eid = l_content_update_eid, ccm_simple_update_received_at = l_content_update_received_at, ccm_updated_at = now()
WHERE ccm_content_id = l_ccm_content_id;

-- XXX problematic statement XXX
UPDATE config_content SET cc_content = l_config_content WHERE cc_config_content_id = l_ccm_content_id;

END IF;
----------

Note that config_content references config_metdata with a foreign key, however,
the referenced column is not updated.

That 'upsert_foo_content' is called by another one, upsert_foo_content_batch,
in a loop over the elements of a JSON array, something like:

----------
CREATE OR REPLACE FUNCTION upsert_foo_content_batch(p_batch jsonb)
RETURN void LANGUAGE plpgpsql
AS $function$
DECLARE
...
BEGIN
FOR item IN SELECT * FROM jsonb_array_elements(p_batch)
LOOP
-- some unpacking of fields from json into the local variables
PERFORM upsert_foo_content(..) -- called with the unpacked variables
END LOOP;
END;
$function$
----------

'upsert_foo_content_batch' is called, in order, at the end of a long pl/pgsql
function 'upsert_foo_event_batch', which consists of a very long CTE that
extracts individual fields from a JSON argument, and then performs a number of
inserts into some tables, doing on conflict do nothing, afterwards performing
more inserts into the tables that reference the previous ones, doing on
conflict do update. However, it modifies neither 'config_content' or
'config_content_metadata' tables.

So the chain of calls is
'upsert_foo_event_batch' ->
'upsert_foo_content_batch' ->
'upsert_foo_content'.
(the last one contains the statement that leads to the "tuple concurrently updated" error).

It is possible that 'upsert_foo_content' function is called with the same data
multiple times in different processes, however, I’d expect it to either
complete successfully, or throw an error because the PK already exists (this is
running in a read committed mode, so ISTM not immune to the case where the row
in the metadata table is inserted after another session does the check, but
before the insert), but not an error mentioned at the beginning of this
message.

Are there any signs in this description that the queries might be doing
something unexpected to PostgreSQL, or that something went wrong during the
split? I am running out of options of what could cause the issue, so any
pointers or help in debugging it is appreciated (note that this is a production
database, I cannot just stop it at will).

Cheers,
Oleksii

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2018-05-17 13:55:34 Re: Postgres 11 release notes
Previous Message Magnus Hagander 2018-05-17 13:38:36 Re: Postgres 11 release notes