Parallel Query Stats

From: Benoit Lobréau <benoit(dot)lobreau(at)dalibo(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Parallel Query Stats
Date: 2023-04-05 13:00:53
Message-ID: d657df20-c4bf-63f6-e74c-cb85a81d0383@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

Parallelism has been in core since 9.6, it's a great feature that got
several
upgrades since then. However, it lacks metrics to determine if and how
parallelism is used and help tune parameters related to it.

Currently, the only information available are pg_stat_activity.backend_type
and pg_stat_activity.leader_pid. These could be sampled to get statistics
about the number of queries that are using parallel workers and the
number of
workers spawned (globally or per statement), but this is not ideal because:

* the sampling period would require a high frequency to get stats
close enough from reality without missing lots of short duration
queries;
* with sampling we cannot get an accurate count of parallel queries;
* we don't know how many queries can't get the workers they asked for.

We thought about several places where we could add some user facing
metrics, and would
like some input about the design before working on a larger patch. The
various chosen
names are obviously not settled.

# Traces

We could add a GUC "log_parallel_draught": it would add a message in the
logs when a
query or utility asks for parallel workers but can't get all of them.

The message could look like this. It could be issued several times per query
since workers can be requested for different parts of the plan.

LOG: Parallel worker draught detected: worker launched: 0, requested: 2
STATEMENT: explain analyze select * from pq_foo inner join pq_bar
using(id);

LOG: Parallel worker draught detected: worker launched: 0, requested: 1
CONTEXT: while scanning relation "public.pv_tbl"
STATEMENT: VACUUM (PARALLEL 2, VERBOSE) pv_tbl;

LOG: Parallel worker draught detected: worker launched: 0, requested: 1
STATEMENT: CREATE INDEX ON pi_tbl(i);

This could be used in tools like pgBadger to aggregate stats
on statements that didn't get their workers, but we might need additionnal
information to know why we are lacking workers.

We have a working PoC patch for this since it seems the most
straightforward to implement and use.

# pg_stat_bgworker view

I was initially thinking about metrics like:
* number of parallel queries
* number of parallel queries that didn't get their workers
But without a number of eligible queries, it's not very useful.

Instead, some metrics could be useful:
* how many workers were requested
* how many workers were obtained.
The data would be updated as the workers are spawned
(or aren't). It would be interesting to have this information per
background worker type in order to identify which pool is the source of a
parallel worker draught.

The view could look like this:

* bgworker_type: possible values would be: logical replication worker /
parallel
worker / parallel maintenance worker / a name given by an extension;
* datname: the database where the workers were connected if applicable,
or null
otherwise;
* active: number of currently running workers;
* requested: number of requested workers ;
* obtained: number of obtained workers ;
* duration: the aggregation of all durations; we could update this field
when a
background worker finishes and add the duration from the one still
running to
produce an more accurate number;
* stats_reset: the reset would be handled the same way other pg_stat* views
handle it.

The parallel maintenance worker type doesn't exist in pg_stat_activity.
I think
it would be worthwhile to add it since this kind of parallel worker has it's
own pool.

This view could be used by monitoring or metrology tools to raise alerts or
trace graphs of the background worker usage, and determine if, when and
where
there is a shortage of workers.

Tools like pg_activity, check_postgres/check_pgactivity or prometheus
exporters could use these stats.

# pg_stat_statements

This view is dedicated to per-query statistics. We could add a few metrics
related to parallelism:

* parallelized_calls: how many executions were planned with parallelism;
* parallelized_draught_calls: how many executions were planned with
parallelism but
didn't get all their workers;
* parallel_workers_requested: how many workers were requested for this
parallel
statement;
* parallel_workers_total: how many workers were obtained for this
parallel statement;

The information is useful to detect queries that didn't get their
workers on a
regular basis. If it's sampled we could know when. It could be used by tools
like POWA to eg. visualize the query runtime depending on the number of
workers, the moment of the day it lacks the requested workers, etc.

The two last could help estimate if a query makes a heavy use of
parallelism.

Note: I have skimmed throught the thread "Expose Parallelism counters
planned/execute
in pg_stat_statements" [1] and still need to take a closer look at it.

[1]
https://www.postgresql.org/message-id/flat/6acbe570-068e-bd8e-95d5-00c737b865e8%40gmail.com

# pg_stat_all_tables and pg_stat_all_indexes

We could add a parallel_seq_scan counter to pg_stat_all_tables. The column
would be incremented for each worker participating in a scan. The leader
would also increment the counter if it is participating.

The same thing could be done to pg_stat_all_indexes with a
parallel_index_scan
column.

These metrics could be used in relation to system stats and other PostgreSQL
metrics such as pg_statio_* in tools like POWA.

# Workflow

An overview of the backgroud worker usage could be viewed via the
pg_stat_bgworker view. It could help detect, and in some cases explain,
parallel
workers draughts. It would also help adapt the size of the worker pools and
prompt us to look into the logs or pg_stat_statements.

The statistics gathered in pg_stat_statements can be used the usual way:
* have an idea of the parallel query usage on the server;
* detect queries that starve from lack of parallel workers;
* compare snapshots to see the impact of parameter modifications;
* combine the statistics with other sources to know:
* if the decrease in parallel workers had on impact on the average
execution duration
* if the increase in parallel workers allocation had an impact on the
system
time;

The logs can be used to pin point specific queries with their parameters or
to get global statistics when pg_stat_statements is not available or
can't be
used.

Once a query is singled out, it can be analysed as usual with EXPLAIN to
determine:
* if the lack of workers is a problem;
* how parallelism helps in this particular case.

Finally, the per relation statitics could be combined with system and other
PostgreSQL metrics to identify why the storage is stressed.

If you reach this point, thank you for reading me!

Many thanks to Melanie Plageman for the pointers she shared with us
around the
pgsessions in Paris and her time in general.

--
Benoit Lobréau
Consultant
http://dalibo.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2023-04-05 13:10:55 Re: Should vacuum process config file reload more often
Previous Message Robert Haas 2023-04-05 12:59:12 Re: Comment typo in recent push