Re: Backends stalled in 'startup' state

From: Ashwin Agrawal <ashwinstar(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Backends stalled in 'startup' state
Date: 2023-01-18 01:21:27
Message-ID: CAKSySwc5OdOWuvWCu+-njdDNjwW=U7WHbxri+5ft5mVRrhB8nA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 17, 2023 at 4:52 PM Ashwin Agrawal <ashwinstar(at)gmail(dot)com> wrote:

>
> We recently saw many backends (close to max_connection limit) get stalled
> in 'startup' in one of the production environments for Greenplum (fork of
> PostgreSQL). Tracing the reason, it was found all the tuples created by
> bootstrap (xmin=1) in pg_attribute were at super high block numbers (for
> example beyond 30,000). Tracing the reason for the backend startup stall
> exactly matched Tom's reasoning in [1]. Stalls became much longer in
> presence of sub-transaction overflow or presence of long running
> transactions as tuple visibility took longer. The thread ruled out the
> possibility of system catalog rows to be present in higher block numbers
> instead of in front for pg_attribute.
>
> This thread provides simple reproduction on the latest version of
> PostgreSQL and RCA for how bootstrap catalog entries can move to higher
> blocks and as a result cause stalls for backend starts. Simple fix to avoid
> the issue provided at the end.
>
> The cause is syncscan triggering during VACUUM FULL. VACUUM FULL rewrites
> the table by performing the seqscan as well. And
> heapam_relation_copy_for_cluster() conveys feel free to use syncscan. Hence
> logic to not start from block 0 instead some other block already in cache
> is possible and opens the possibility to move the bootstrap tuples to
> anywhere else in the table.
>
> ------------------------------------------------------------------
> Repro
> ------------------------------------------------------------------
> -- create database to play
> drop database if exists test;
> create database test;
> \c test
>
> -- function just to create many tables to increase pg_attribute size
> -- (ideally many column table might do the job more easily)
> CREATE OR REPLACE FUNCTION public.f(id integer)
> RETURNS void
> LANGUAGE plpgsql
> STRICT
> AS $function$
> declare
> sql text;
> i int;
> begin
> for i in id..id+9999 loop
> sql='create table if not exists tbl'||i||' (id int)';
> execute sql;
> end loop;
> end;
> $function$;
>
> select f(10000);
> select f(20000);
> select f(30000);
> select f(40000);
>
> -- validate pg_attribute size is greater than 1/4 of shared_buffers
> -- for syncscan to triggger
> show shared_buffers;
> select pg_size_pretty(pg_relation_size('pg_attribute'));
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>
> -- perform seq scan of pg_attribute to page past bootstrapped tuples
> copy (select * from pg_attribute limit 2000) to '/tmp/p';
>
> -- this will internally use syncscan starting with block after bootstrap
> tuples
> -- and hence move bootstrap tuples last to higher block numbers
> vacuum full pg_attribute;
>
> ------------------------------------------------------------------
> Sample run
> ------------------------------------------------------------------
> show shared_buffers;
> shared_buffers
> ----------------
> 128MB
> (1 row)
>
> select pg_size_pretty(pg_relation_size('pg_attribute'));
> pg_size_pretty
> ----------------
> 40 MB
> (1 row)
>
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
> ctid | xmin | attrelid | attname
> -------+------+----------+--------------
> (0,1) | 1 | 1255 | oid
> (0,2) | 1 | 1255 | proname
> (0,3) | 1 | 1255 | pronamespace
> (0,4) | 1 | 1255 | proowner
> (0,5) | 1 | 1255 | prolang
> (5 rows)
>
> copy (select * from pg_attribute limit 2000) to '/tmp/p';
> COPY 2000
> vacuum full pg_attribute;
> VACUUM
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
> ctid | xmin | attrelid | attname
> -----------+------+----------+--------------
> (5115,14) | 1 | 1255 | oid
> (5115,15) | 1 | 1255 | proname
> (5115,16) | 1 | 1255 | pronamespace
> (5115,17) | 1 | 1255 | proowner
> (5115,18) | 1 | 1255 | prolang
> (5 rows)
>
>
> Note:
> -- used logic causing the problem to fix it as well on the system :-)
> -- scan till block where bootstrap tuples are located
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
> -- now due to syncscan triggering it will pick the blocks with bootstrap
> tuples first and help to bring them back to front
> vacuum full pg_attribute;
>
> ------------------------------------------------------------------
> Patch to avoid the problem:
> ------------------------------------------------------------------
> diff --git a/src/backend/access/heap/heapam_handler.c
> b/src/backend/access/heap/heapam_handler.c
> index a3414a76e8..4c031914a3 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -757,7 +757,17 @@ heapam_relation_copy_for_cluster(Relation OldHeap,
> Relation NewHeap,
> pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
>
> PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
>
> - tableScan = table_beginscan(OldHeap, SnapshotAny, 0,
> (ScanKey) NULL);
> + /*
> + * For system catalog tables avoid syncscan, so that scan
> always
> + * starts from block 0 during rewrite and helps retain
> bootstrap
> + * tuples in initial pages only. If using syncscan, then
> bootstrap
> + * tuples may move to higher blocks, which will lead to
> degraded
> + * performance for relcache initialization during
> connection starts.
> + */
> + if (is_system_catalog)
> + tableScan = table_beginscan_strat(OldHeap,
> SnapshotAny, 0, (ScanKey) NULL, true, false);
> + else
> + tableScan = table_beginscan(OldHeap, SnapshotAny,
> 0, (ScanKey) NULL);
> heapScan = (HeapScanDesc) tableScan;
> indexScan = NULL;
> ------------------------------------------------------------------
>
>
> 1] https://www.postgresql.org/message-id/27844.1338148415%40sss.pgh.pa.us
>

Missed to receive comment/reply to earlier email on
pgsql-hackers(at)lists(dot)postgresql(dot)org hence trying via
pgsql-hackers(at)postgresql(dot)org this time (as not sure was missed or no
interest).

Also, I wish to add more scenarios where the problem manifests.
During RelationCacheInitializePhase3() -> load_critical_index() performs
sequential search for tuples in pg_class
for ClassOidIndexId, AttributeRelidNumIndexId, IndexRelidIndexId,
OpclassOidIndexId, AccessMethodProcedureIndexId,
RewriteRelRulenameIndexId
and TriggerRelidNameIndexId. We found on systems that tuples corresponding
to these indexes are not always present in starting blocks of pg_class.
Specially
for pg_opclass_oid_index, pg_rewrite_rel_rulename_index,
pg_amproc_fam_proc_index, pg_trigger_tgrelid_tgname_index,
pg_index_indexrelid_index
to be present many times in block numbers over 2000 and such. Not fully
sure on reasoning for this - maybe REINDEX (moves them to higher block
numbers). Under any situation where tuple visibility slows down (let's say
due to sub-transaction overflow) and relcache is invalidated, a lot of
backends were seen stalled in the "startup" phase.

--
*Ashwin Agrawal (VMware)*

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Chudnovsky 2023-01-18 01:53:56 Re: [PoC] Federated Authn/z with OAUTHBEARER
Previous Message Justin Pryzby 2023-01-18 01:10:18 Re: CI and test improvements