Re: RFC: Logging plan of the running query

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Étienne BERSAC <etienne(dot)bersac(at)dalibo(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, jtc331(at)gmail(dot)com, rafaelthca(at)gmail(dot)com
Subject: Re: RFC: Logging plan of the running query
Date: 2024-02-06 10:51:58
Message-ID: CAExHW5vEi_8RJC2X2mp+RSAvTXhh4NLuNd+7tQGb7P7jxgdKSg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Atsushi,

On Mon, Jan 29, 2024 at 6:32 PM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
>
> Hi,
>
> Updated the patch to fix typos and move
> ProcessLogQueryPlanInterruptActive from errfinish() to AbortTransaction.
>
>
> BTW since the thread is getting long, I list the some points of the
> discussion so far:
>
> # Safety concern
> ## Catalog access inside CFI
> - it seems safe if the CFI call is inside an existing valid
> transaction/query state[1]
>
> - We did some tests, for example calling ProcessLogQueryPlanInterrupt()
> in every single CHECK_FOR_INTERRUPTS()[2]. This test passed on my env
> but was stucked on James's env, so I modified to exit
> ProcessLogQueryPlanInterrupt() when target process is inside of lock
> acquisition code[3]
>
> ## Risk of calling EXPLAIN code in CFI
> - EXPLAIN is not a simple logic code, and there exists risk calling it
> from CFI. For example, if there is a bug, we may find ourselves in a
> situation where we can't cancel the query
>
> - it's a trade-off that's worth making for the introspection benefits
> this patch would provide?[4]
>
> # Design
> - Although some suggested it should be in auto_explain, current patch
> introduces this feature to the core[5]
>
> - When the target query is nested, only the most inner query's plan is
> explained. In future, all the nested queries' plans are expected to
> explained optionally like auto_explain.log_nested_statements[6]
>
> - When the target process is a parallel worker, the plan is not shown[6]
>
> - When the target query is nested and its subtransaction is aborted,
> pg_log_query_plan cannot log the parental query plan after the abort
> even parental query is running[7]
>
> - The output corresponds to EXPLAIN with VERBOSE, COST, SETTINGS and
> FORMAT text. It doesn't do ANALYZE or show the progress of the query
> execution. Future work proposed by Rafael Thofehrn Castro may realize
> this[8]
>
> - To prevent assertion error, this patch ensures no page lock is held by
> checking all the LocalLock entries before running explain code, but
> there is a discussion that ginInsertCleanup() should be modified[9]
>
>
> It may be not so difficult to improve some of restrictions in "Design",
> but I'd like to limit the scope of the 1st patch to make it simpler.

Thanks for the summary. It is helpful. I think patch is also getting better.

I have a few questions and suggestions
1. Prologue of GetLockMethodLocalHash() mentions
* NOTE: When there are many entries in LockMethodLocalHash, calling this
* function and looking into all of them can lead to performance problems.
*/
How bad this performance could be. Let's assume that a query is taking
time and pg_log_query_plan() is invoked to examine the plan of this
query. Is it possible that the looping over all the locks itself takes
a lot of time delaying the query execution further?

2. What happens if auto_explain is enabled in the backend and
pg_log_query_plan() is called on the same backend? Will they conflict?
I think we should add a test for the same.

Using injection point support we should be able to add tests for
testing pg_log_query_plan behaviour when there are page locks held or
when auto_explain (with instrumentation) and pg_log_query_plan() work
on the same query plan. Use injection point to make the backend
running query wait at a suitable point to delay its execution and fire
pg_log_query_plan() from other backend. May be the same test could
examine the server log file to see if the plan is indeed output to the
server log file.

Given that the feature will be used when the things have already gone
wrong, it should not make things more serious. So more testing and
especially automated would help.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-02-06 10:53:12 Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Previous Message Michael Paquier 2024-02-06 10:50:39 Re: Printing backtrace of postgres processes