Backends stalled in 'startup' state

From: Ashwin Agrawal <ashwinstar(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Backends stalled in 'startup' state
Date: 2022-09-15 23:42:18
Message-ID: CAKSySwf+YH_4=UDosdyNG2+EYOFaThsATB9AfNkUjM4uq+69vQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We recently saw many backends (close to max_connection limit) get stalled
in 'startup' in one of the production environments for Greenplum (fork of
PostgreSQL). Tracing the reason, it was found all the tuples created by
bootstrap (xmin=1) in pg_attribute were at super high block numbers (for
example beyond 30,000). Tracing the reason for the backend startup stall
exactly matched Tom's reasoning in [1]. Stalls became much longer in
presence of sub-transaction overflow or presence of long running
transactions as tuple visibility took longer. The thread ruled out the
possibility of system catalog rows to be present in higher block numbers
instead of in front for pg_attribute.

This thread provides simple reproduction on the latest version of
PostgreSQL and RCA for how bootstrap catalog entries can move to higher
blocks and as a result cause stalls for backend starts. Simple fix to avoid
the issue provided at the end.

The cause is syncscan triggering during VACUUM FULL. VACUUM FULL rewrites
the table by performing the seqscan as well. And
heapam_relation_copy_for_cluster() conveys feel free to use syncscan. Hence
logic to not start from block 0 instead some other block already in cache
is possible and opens the possibility to move the bootstrap tuples to
anywhere else in the table.

------------------------------------------------------------------
Repro
------------------------------------------------------------------
-- create database to play
drop database if exists test;
create database test;
\c test

-- function just to create many tables to increase pg_attribute size
-- (ideally many column table might do the job more easily)
CREATE OR REPLACE FUNCTION public.f(id integer)
RETURNS void
LANGUAGE plpgsql
STRICT
AS $function$
declare
sql text;
i int;
begin
for i in id..id+9999 loop
sql='create table if not exists tbl'||i||' (id int)';
execute sql;
end loop;
end;
$function$;

select f(10000);
select f(20000);
select f(30000);
select f(40000);

-- validate pg_attribute size is greater than 1/4 of shared_buffers
-- for syncscan to triggger
show shared_buffers;
select pg_size_pretty(pg_relation_size('pg_attribute'));
select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 limit
5;

-- perform seq scan of pg_attribute to page past bootstrapped tuples
copy (select * from pg_attribute limit 2000) to '/tmp/p';

-- this will internally use syncscan starting with block after bootstrap
tuples
-- and hence move bootstrap tuples last to higher block numbers
vacuum full pg_attribute;

------------------------------------------------------------------
Sample run
------------------------------------------------------------------
show shared_buffers;
shared_buffers
----------------
128MB
(1 row)

select pg_size_pretty(pg_relation_size('pg_attribute'));
pg_size_pretty
----------------
40 MB
(1 row)

select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 limit
5;
ctid | xmin | attrelid | attname
-------+------+----------+--------------
(0,1) | 1 | 1255 | oid
(0,2) | 1 | 1255 | proname
(0,3) | 1 | 1255 | pronamespace
(0,4) | 1 | 1255 | proowner
(0,5) | 1 | 1255 | prolang
(5 rows)

copy (select * from pg_attribute limit 2000) to '/tmp/p';
COPY 2000
vacuum full pg_attribute;
VACUUM
select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 limit
5;
ctid | xmin | attrelid | attname
-----------+------+----------+--------------
(5115,14) | 1 | 1255 | oid
(5115,15) | 1 | 1255 | proname
(5115,16) | 1 | 1255 | pronamespace
(5115,17) | 1 | 1255 | proowner
(5115,18) | 1 | 1255 | prolang
(5 rows)

Note:
-- used logic causing the problem to fix it as well on the system :-)
-- scan till block where bootstrap tuples are located
select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 limit
5;
-- now due to syncscan triggering it will pick the blocks with bootstrap
tuples first and help to bring them back to front
vacuum full pg_attribute;

------------------------------------------------------------------
Patch to avoid the problem:
------------------------------------------------------------------
diff --git a/src/backend/access/heap/heapam_handler.c
b/src/backend/access/heap/heapam_handler.c
index a3414a76e8..4c031914a3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -757,7 +757,17 @@ heapam_relation_copy_for_cluster(Relation OldHeap,
Relation NewHeap,
pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,

PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);

- tableScan = table_beginscan(OldHeap, SnapshotAny, 0,
(ScanKey) NULL);
+ /*
+ * For system catalog tables avoid syncscan, so that scan
always
+ * starts from block 0 during rewrite and helps retain
bootstrap
+ * tuples in initial pages only. If using syncscan, then
bootstrap
+ * tuples may move to higher blocks, which will lead to
degraded
+ * performance for relcache initialization during
connection starts.
+ */
+ if (is_system_catalog)
+ tableScan = table_beginscan_strat(OldHeap,
SnapshotAny, 0, (ScanKey) NULL, true, false);
+ else
+ tableScan = table_beginscan(OldHeap, SnapshotAny,
0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
------------------------------------------------------------------

1] https://www.postgresql.org/message-id/27844.1338148415%40sss.pgh.pa.us

--
*Ashwin Agrawal (VMware)*

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-09-16 00:05:46 Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Previous Message Tom Lane 2022-09-15 23:09:55 Oddities in our pg_attribute_printf usage