[PATCH v2] Progress command to monitor progression of long running SQL queries

From: Remi Colinet <remi(dot)colinet(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [PATCH v2] Progress command to monitor progression of long running SQL queries
Date: 2017-05-10 16:40:31
Message-ID: CADdR5ny_0dFwnD+suBnV1Vz6NDKbFHeWoV1EDv9buhDCtc3aAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

This is version 2 of the new command name PROGRESS which I wrote in order
to monitor long running SQL queries in a Postgres backend process.

New command justification
====================

The purpose of the command is to monitor long running SQL queries in a
backend process to evaluate when they will complete.

First, the user needs to find the pid of the backend process which is
executing the long running SQL query for which we want to monitor
progression.
This search may be based on the SQL query, the session, the terminal, using
the view pg_stat_activity. This view provides the best method to identify
the query to be monitored.

Once the pid of the backend is determined, user may run:
psql> PROGRESS [VERBOSE|(FORMAT TEXT|JSON|XML|INLINE)] PID_OF_BACKEND

The output will show a table similar to the one of the EXPLAIN command but
with further details about the progression of execution of each node which
may take a long time to be executed.

Such nodes are SeqScan (Sequential Scans), IndexScan (Index Scans), Sort
(long sorts which require disk tapes mecanism), TupleStore (Store on disks
of tuple tables), Limit, HashJoinTable which use one disk files to perform
HashJoin.
Other nodes use the previous nodes. For instance, Material nodes use
TupleStore nodes.

The PROGRESS command can be used with the psql \watch command. For
instance, user may run:

psql> \watch PROGRESS PID_OF_BACKEND

This is handy to view progression of SQL queries.

Use case
=======

A use case is shown in the below example based on a table named t_10m with
at least 10 millions rows.

The table has been created with :

CREATE TABLE T_10M ( id integer, md5 text);
INSERT INTO T_10M SELECT generate_series(1,10000000) AS id,
md5(random()::text) AS md5;
ANALYSE t_10m;

ANALYSE is needed to have statistics updated. These are used to compare
rows fetched with total number of rows and report percentage of execution
of query nodes.
If ANALYSE is not run against relations used by the monitored query,
statistics may be inaccurate.

1/ Start a first psql session to run a long SQL query:

[pgadm(at)rco ~]$ psql -A -d test
psql (10devel)
Type "help" for help.
test=#

The option -A is used to allow rows to be output without formatting work.
Redirect output to a file in order to let the query run without terminal
interaction:

test=# \o out

Start a long running query:
test=# select * from t_10M order by md5;

2/ In a second psql session, list the backend pid(s) and their matching SQL
query

[pgadm(at)rco ~]$ psql -d test
psql (10devel)
Type "help" for help.

test=# select * from prg;
pid | query
-------+-----------------------------------
26306 |
26309 |
26311 | select * from t_10m order by md5;
26312 | select * from t_10m order by md5;
26313 | select * from t_10m order by md5;
26330 | select * from prg;
26304 |
26303 |
26305 |
(9 rows)

test=#

Chose the pid of the backend running the long SQL query to be monitored.
Above example is a parallel SQL query. Lowest pid is the main backend of
the query.

test=# PROGRESS 26311;

PLAN PROGRESS

------------------------------------------------------------
--------------------------------
status: <query running>
query: select * from t_10m order by md5;
time used (s): 20
Gather Merge
worker child: 26312
-> Sort => rows r/w merge 0/0 sort 0/1895721 0% tape blocks 10903
Sort Key: md5
-> Parallel Seq Scan on t_10m => rows 1903441/4166657 45% => blks
49924/83334 59%
worker child: 26313
-> Sort => rows r/w merge 0/0 sort 0/1967521 0% tape blocks 11314
Sort Key: md5
-> Parallel Seq Scan on t_10m => rows 1975561/4166657 47% => blks
49927/83334 59%
worker parent: 26311
-> Sort on tapes writing => rows r/w merge 0/0 sort 0/2075590 0% tape
blocks 11935
Sort Key: md5
-> Parallel Seq Scan on t_10m => rows 2111550/4166657 50% => blks
49928/83334 59%
total disk space used (MB) 266
(17 rows)

test=#

The query of the monitored backend is:
test=# select * from t_10M order by md5;

Because the table has 10 millions of rows, the sort is done on tapes.

The output shows the progression in terms of:
- rows with:
- already fetched rows,
- total rows to be fetched,
- blocks with:
- already fetched blocks
- total blocks to be fetched

Each node may have different output types.
Progression is reported in terms of rows, blocks, bytes, and percentages.

Sort nodes show tapes blocks being used and step phase (merge/sort).

Implementation of the command
========================

The design of the command is:

- the user issue the "PROGRESS pid" command from a psql session.
The pid is the one of the backend which runs the SQL query for which we
want to get a progression report. It can be determined from the view
pg_stat_activity.

- the monitoring backend, upon receiving the "PROGRESS pid" command from
psql utility used in step above, sends a signal to the backend whose
process pid is the one provided in the PROGRESS command, and whose reason
is PROCSIG_PROGRESS defined in src/include/storage/procsignal.h

- the monitored backend receives the signal with the reason of the signal
(progression request) and notes the request as for any interrupt. Then, it
continues its execution of its SQL query until interrupts can be serviced.

- when the monitored process can service the interrupts, it deals with the
progress request by collecting its execution tree with the execution
progress of each long running node. At this time, the SQL query must be in
a consistent state without partial nodes under alllocation or freeing. This
is enforced by flags added in the executor. The progression is only
collected if the backend is in the function of ExecutorRun(). If execution
tree is in a consistent state, it is dumped in shared memory pages
allocated at the start of the server. Then, the monitored backend set a
latch on which the monitoring backend is waiting for. It then continues
executing its SQL query.

- the monitoring backend collects the share memory data dumped, and sends
it to its psql session as a list of rows.

The command PROGRESS does not incur any slowness on the running query
because the execution progression is only computed upon receiving the
progress request which is supposed to be seldom used.

The code heavily reuses the one of the explain command. In order to share
as much code as possible with the EXPLAIN command, part of the EXPLAIN code
which deals with reporting quals for instance, has been moved to a new
report.c file in the src/backend/commands folder. This code in report.c is
shared between explain.c source code and PROGRESS command source code which
is in progress.c file.

The progression reported by PROGRESS command is given in terms of rows,
blocks, bytes and percents. The values displayed depend on the node type in
the execution plan. The current patch implements all the possible nodes
which could take a lot of time.

Parallel queries can also be monitored. The same mecanism is used to
monitor child workers with a slight difference: the main worker requests
the child progression directly in order to dump the whole progress tree in
shared memory.

Patch
=====

The patch for version 2 is attached to the current email.
The patch has been rebased on latest commits.

The patch is also available at : https://github.com/colinet/
progress/tree/progress

The diff stat of the patch is:

[root(at)rco pg]# git diff --stat master..
contrib/auto_explain/auto_explain.c | 5 +-
contrib/postgres_fdw/postgres_fdw.c | 13 +-
src/backend/access/heap/heapam.c | 2 +
src/backend/access/transam/parallel.c | 2 +-
src/backend/commands/Makefile | 3 +-
src/backend/commands/explain.c | 2834
++++++++++++++----------------------------------------------
-------------------------------------------------
src/backend/commands/prepare.c | 5 +-
src/backend/commands/progress.c | 2035
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++
src/backend/commands/report.c | 2252
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++
src/backend/executor/execMain.c | 8 +
src/backend/executor/execProcnode.c | 31 ++
src/backend/executor/nodeBitmapHeapscan.c | 13 +-
src/backend/executor/nodeIndexonlyscan.c | 13 +-
src/backend/executor/nodeIndexscan.c | 15 +-
src/backend/executor/nodeSamplescan.c | 12 +-
src/backend/executor/nodeSeqscan.c | 16 +-
src/backend/nodes/bitmapset.c | 19 +
src/backend/nodes/outfuncs.c | 245 ++++++++++
src/backend/parser/gram.y | 79 +++-
src/backend/postmaster/postmaster.c | 1 +
src/backend/storage/file/buffile.c | 73 +++
src/backend/storage/file/fd.c | 15 +
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/ipc/procarray.c | 57 +++
src/backend/storage/ipc/procsignal.c | 4 +
src/backend/storage/lmgr/lwlock.c | 7 +-
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/tcop/postgres.c | 65 ++-
src/backend/tcop/pquery.c | 25 +
src/backend/tcop/utility.c | 10 +
src/backend/utils/init/globals.c | 12 +
src/backend/utils/sort/logtape.c | 18 +
src/backend/utils/sort/tuplesort.c | 153 +++++-
src/backend/utils/sort/tuplestore.c | 75 ++-
src/include/access/parallel.h | 1 +
src/include/commands/explain.h | 67 +--
src/include/commands/prepare.h | 2 +-
src/include/commands/report.h | 159 +++++++
src/include/executor/execdesc.h | 3 +
src/include/executor/progress.h | 52 ++
src/include/foreign/fdwapi.h | 10 +-
src/include/miscadmin.h | 2 +
src/include/nodes/bitmapset.h | 1 +
src/include/nodes/execnodes.h | 3 +
src/include/nodes/extensible.h | 6 +-
src/include/nodes/nodes.h | 8 +
src/include/nodes/parsenodes.h | 11 +
src/include/nodes/plannodes.h | 11 +
src/include/parser/kwlist.h | 2 +
src/include/pgstat.h | 3 +-
src/include/storage/buffile.h | 11 +
src/include/storage/fd.h | 2 +
src/include/storage/procarray.h | 3 +
src/include/storage/procsignal.h | 3 +
src/include/tcop/pquery.h | 1 +
src/include/utils/logtape.h | 2 +
src/include/utils/tuplesort.h | 72 ++-
src/include/utils/tuplestore.h | 35 ++
58 files changed, 5972 insertions(+), 2619 deletions(-)
[root(at)rco pg]#

The progress command can be used with the watch command of psql making it
more handy to monitor a long running query.
The default format of the PROGRESS command is text inline. It can be set as
json, xml, or text as for EXPLAIN command.

Use cases
========

Some further examples of use are shown below in the test_v2.txt file with
parallel queries, sorts, hashes, scans, json format output.

TODO
=====

Monitor progress of utilities commands such as CREATE INDEX which may take
a long time to complete.
Write documentation

Changelog
========

v2:
Added JSON, XML, TEXT INLINE format output
Added time consumed to run SQL queries
Added SQL query being monitored
Added support for parallel queries
Added disk used by the queries
Rebased on 9th May 2017 commits.

v1:
Initial version with only TEXT format reporting

Any suggestion, comment or feedback is welcome.

Regards
Remi

Attachment Content-Type Size
test_v2.txt text/plain 16.4 KB
progress_v2.patch text/x-patch 301.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-05-10 16:43:20 Re: [POC] hash partitioning
Previous Message Robert Haas 2017-05-10 16:38:54 Re: [POC] hash partitioning