Quick Links

[PATCH v3] pg_progress() SQL function to monitor progression of long running SQL queries/utilities

From:	Remi Colinet <remi(dot)colinet(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	[PATCH v3] pg_progress() SQL function to monitor progression of long running SQL queries/utilities
Date:	2017-06-21 14:01:14
Message-ID:	CADdR5nxQUSh5kCm9MKmNga8+c1JLxLHDzLhAyXpfo9-Wmc6s5g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello,

This is version 3 of a new SQL function written in order to monitor long
running SQL queries/utilities in Postgres backends.
Previous patches were using a command named PROGRESS. This was restricting
the use to psql only.

Rationale for the new SQL function
=========================

The purpose of the new SQL function is to monitor the progression of long
running SQL queries/utilities in backend processes. It can be used to
evaluate when a query will complete and consequently whether it will
complete within a reasonable amount of time.

The function is named pg_progress(pid, verbosity).

Arguments are:

- pid : process pid of a given backend. This pid can be found using
pg_stat_activity(). This value can be also 0, in which case progress of all
backend processes is requested, one by one. Only regular backends can be
monitored. Specific backends such as WAL writer for instance are not
concerned by the new function.

- verbosity: this value ranges from 0 to 3. Higher value provide more
details. This also controls whether the disk space resource is exhibited.

The function returns a set of rows similar to name/value pairs reflecting
the execution tree and its progression for long running nodes. Long running
nodes are for instance SeqScan, IndexScan, Sort using tapes, TupleStore (on
disks), Limit, HashJoin.

The row fields returned by the SQL function are:

- pid: process pid of the backend being monitored

- ppid: parent pid or master processs pid for parallel queries

- bid: backend Id

- lineid: line number of a given backend

- indent: indentation of the node. This field and lineid are used to order
and indent the output, in order to reflect the execution tree.

- type: 3 node types exists:
- nodes which are for instance SeqScan, HasjJoin, Sorts.
- relationship between nodes and child nodes of the execution tree
- properties exhibiting the progress execution of long running nodes

- name: the name of the node/relationship/property. This is the name in the
name/value pairs.

- value: the value of the node.

- unit: the unit for the above value field.

SQL query run in the monitoring process can be used to exhibit the
execution tree progress.

test=# SELECT pid, ppid, bid, concat(repeat(' ', 3 * indent),name),
value, unit FROM pg_progress(0,0);

indent field can be used to reflect the tree nature of the execution tree.

Use case
=======

A use case is shown in the below example based on a table named t_10m with
10 millions rows.

The table has been created with :

CREATE TABLE T_10M ( id integer, md5 text);
INSERT INTO T_10M SELECT generate_series(1,10000000) AS id,
md5(random()::text) AS md5;
ANALYSE t_10m;

ANALYSE is desired to be run in order to have statistics updated.

1/ Start a first psql session to run a long SQL query:

[pgadm(at)rco ~]$ psql -A -d test
psql (10devel)
Type "help" for help.
test=#

The option -A is used to allow unaligned rows in the output. Otherwise psql
might slow down progression of the query.
Redirect output to a file in order to let the query run without terminal
interaction:

test=# \o file

Start a long running query:
test=# select * from t_10M order by md5;

2/ In a second psql session, list the backend pid(s) and their matching SQL
query

[pgadm(at)rco ~]$ psql -d test
psql (10devel)
Type "help" for help.

test=# SELECT pid, ppid, bid, concat(repeat(' ', 3 * indent),name),
value, unit FROM pg_progress(0,0);
pid | ppid | bid | concat | value | unit
-------+------+-----+------------------+------------------+---------
14106 | 0 | 4 | status | query running |
14106 | 0 | 4 | relationship | progression |
14106 | 0 | 4 | node name | Sort |
14106 | 0 | 4 | sort status | on tapes writing |
14106 | 0 | 4 | completion | 0 | percent
14106 | 0 | 4 | relationship | Outer |
14106 | 0 | 4 | node name | Seq Scan |
14106 | 0 | 4 | scan on | t_10m |
14106 | 0 | 4 | fetched | 25049 | block
14106 | 0 | 4 | total | 83334 | block
14106 | 0 | 4 | completion | 30 | percent
(11 rows)

test=#

Chose the pid of the backend running the long SQL query to be monitored, or
use 0 as above for the pid to get all backends progress report.

The sort is done on tapes because the table has 10 millions of rows.

The output shows the progression in terms of:
- blocks with:
- already fetched blocks
- total blocks to be fetched
- percentage of completion

Each node may have different output types.
Progression is reported in terms of rows, blocks, bytes, and percentages.

Sort nodes show tapes blocks being used and step phases (merge/sort).

Parallel queries can be monitored by observing progress of both master and
worker backends. In one psql session, run:

test=# SELECT pid, ppid, bid, concat(repeat(' ', 3 * indent),name),
value, unit FROM pg_progress(0,0);
pid | ppid | bid | concat | value | unit
-------+-------+-----+---------------------+-----------------+---------
26924 | 0 | 4 | status | query running |
26924 | 0 | 4 | relationship | progression |
26924 | 0 | 4 | node name | Gather Merge |
26924 | 0 | 4 | relationship | Outer |
26924 | 0 | 4 | node name | Sort |
26924 | 0 | 4 | completion | 0 | percent
26924 | 0 | 4 | relationship | Outer |
26924 | 0 | 4 | node name | Seq Scan |
26924 | 0 | 4 | node mode | parallel |
26924 | 0 | 4 | scan on | t_10m |
26924 | 0 | 4 | scan mode | parallel |
26924 | 0 | 4 | fetched | 38231 | block
26924 | 0 | 4 | total | 132403 | block
26924 | 0 | 4 | completion | 28 | percent
26926 | 26924 | 6 | relationship | child worker |
26926 | 26924 | 6 | node name | Sort |
26926 | 26924 | 6 | completion | 0 | percent
26926 | 26924 | 6 | relationship | Outer |
26926 | 26924 | 6 | node name | Seq Scan |
26926 | 26924 | 6 | node mode | parallel |
26926 | 26924 | 6 | scan on | t_10m |
26926 | 26924 | 6 | scan mode | parallel |
26926 | 26924 | 6 | fetched | 38246 | block
26926 | 26924 | 6 | total | 132403 | block
26926 | 26924 | 6 | completion | 28 | percent
26925 | 26924 | 5 | relationship | child worker |
26925 | 26924 | 5 | node name | Sort |
26925 | 26924 | 5 | completion | 0 | percent
26925 | 26924 | 5 | relationship | Outer |
26925 | 26924 | 5 | node name | Seq Scan |
26925 | 26924 | 5 | node mode | parallel |
26925 | 26924 | 5 | scan on | t_10m |
26925 | 26924 | 5 | scan mode | parallel |
26925 | 26924 | 5 | fetched | 38283 | block
26925 | 26924 | 5 | total | 132403 | block
26925 | 26924 | 5 | completion | 28 | percent
(36 rows)

test=#

Above output shows a master backend for which pid is 26924 and worker
backends are 26926 and 26925.
Nodes types are SeqScan and Sort.

Design of the SQL function
===================

The design of the SQL function relies on a per backend shared memory
structure named ProgressCtl. This structure contains fields for progression
requests and responses.

The monitoring process fills the request fields in the per backend
ProgressCtl structure. Once the request fields are set, it sends a signal
to the monitored process and starts waiting on a per backend latch, for the
response of the monitored backend to be available.

The monitored process, upon receiving the signal stops temporarily its
execution and note the progression request. It then continues its execution
until interruption can be serviced. At this moment, it disables interrupts
and walks through its execution tree to collect entirely the progress of
each node of the execution tree. This is done in one step interrupting the
backend SQL query execution. Structure ProgressState is used to manage the
progress report. It is stored in the monitored backend process. Then, it
dumps the progress report in a per backend shared memory page and set the
latch to allow the monitoring backend to wake up and collect the report put
in shared memory. The monitored backend continues the execution of its SQL
query.

The monitoring backend collects the progress report in shared memory, and
sends it to its psql session as a list of rows.

If the one shared memory page is not enough for the whole progress report,
the progress report transfert between the 2 backends is done with a series
of request/response. Before setting the latch, the monitored backend write
the size of the data dumped in shared memory and set a status to indicate
that more data is to be sent through the shared memory page. The monitoring
backends get the result and sends an other signal, and then wait for the
latch again. The monitored backend does not collect a new progress report
but continues to dump the already collected report. And the exchange goes
on until the full progress report has been dumped.

Should the backend process disappear, the wait on latch will eventually
complete after a timeout value which is guc variable. Default is 5 seconds
which should be enough for most cases.

Should the monitoring process disappear, the monitored backed will dump
data in shared memory and continue its execution. On the next progression
request, the monitored backend starts a new progress report.

When the monitored process execute a SQL query, it collects progress of
each long running node with counters of rows, blocks. This data collection
incurs a very low overhead on the running query.

At the time a monitored backend deals with a progress report, the SQL query
must be in a consistent state without partial nodes under allocation or
freeing. This is needed to walk through the execution tree at this moment,
in the middle of the SQL query execution. This is enforced by a flag added
in the executor. The progression is only collected if the backend is in the
function of ExecutorRun() which is known based on the execution flag. If
execution tree is in a consistent state, it is collected and dumped in the
shared memory page. The dump of the progress report is only done upon
receiving the progress request which is supposed to be seldom used.

Parallel queries can also be monitored. The same mecanism is used to
monitor child workers.

Utility such as CREATE INDEX which may be long to complete are also
monitored.

Patch
=====

The patch for version 3 is attached to the current email.
The patch has been rebased on latest commits as per today.

The patch is also available at : https://github.com/colinet/pro
gress/tree/progress

The diff stat of the patch is:

2 main files are created:
src/backend/utils/adt/pg_progress.c
src/include/utils/pg_progress.h

Other files are concerned by small changes in order to collect progression
of the long running nodes.

Changelog
========

v3:
Added documentation
Added guc parameters for policy change (timeout, delay before reporting).
Restrict use of pg_progress() to superuser only
Support utility progression with CREATE INDEX.
Allow multiple monitoring backends at the same time to request progress of
the backends.
Fixed design of shared memory exchange between monitoring and monitored
backends when report is more than one shm page size.
Fixed locking of backend report request.
Moved code for new SQL function in src/backend/utils/adt/pg_progress.c
Use LWLockTranche to allocate one LWlock per backend. This lock to used to
serialize progression requests on backends.
Replace PROGRESS command by a new SQL function returning rows instead of
TEXT, JSON, XML.
Improved debug support

v2:
Added JSON, XML, TEXT INLINE format output
Added time consumed to run SQL queries
Added SQL query being monitored
Added support for parallel queries
Added disk used by the queries
Rebased on 9th May 2017 commits.

v1:
Initial version with only TEXT format reporting

Any suggestion, comment or feedback is welcome.

Regards

Attachment	Content-Type	Size
progress-v3.patch	text/x-patch	112.1 KB

Responses

Re: [PATCH v3] pg_progress() SQL function to monitor progression of long running SQL queries/utilities at 2017-07-26 13:27:51 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Albe Laurenz	2017-06-21 14:05:35	Re: Logical replication in the same cluster
Previous Message	Yugo Nagata	2017-06-21 13:57:40	Re: Optional message to user when terminating/cancelling backend