Re: RFC: Logging plan of the running query

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Étienne BERSAC <etienne(dot)bersac(at)dalibo(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, jtc331(at)gmail(dot)com, rafaelthca(at)gmail(dot)com
Subject: Re: RFC: Logging plan of the running query
Date: 2024-02-07 04:58:18
Message-ID: CAExHW5ukAdmnq2B8gVJ4ufWFUrQvFYtoMf_fGgpsDkTtEF4oZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 7, 2024 at 9:38 AM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
>
> Hi Ashutosh,
>
> On 2024-02-06 19:51, Ashutosh Bapat wrote:
>
> > Thanks for the summary. It is helpful. I think patch is also getting
> > better.
> >
> > I have a few questions and suggestions
>
> Thanks for your comments.
>
> > 1. Prologue of GetLockMethodLocalHash() mentions
> > * NOTE: When there are many entries in LockMethodLocalHash, calling
> > this
> > * function and looking into all of them can lead to performance
> > problems.
> > */
> > How bad this performance could be. Let's assume that a query is taking
> > time and pg_log_query_plan() is invoked to examine the plan of this
> > query. Is it possible that the looping over all the locks itself takes
> > a lot of time delaying the query execution further?
>
> I think it depends on the number of local locks, but I've measured cpu
> time for this page lock check by adding below codes and
> v27-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch[1], which
> calls ProcessLogQueryPlanInterrupt() in every CFI on my laptop just for
> your information:
>
> diff --git a/src/backend/commands/explain.c
> b/src/backend/commands/explain.c
> index 5f7d77d567..65b7cb4925 100644
> --- a/src/backend/commands/explain.c
> +++ b/src/backend/commands/explain.c
> @@ -44,6 +44,8 @@
>
> +#include "time.h"
> ...
> @@ -5287,6 +5292,7 @@ ProcessLogQueryPlanInterrupt(void)
> * we check all the LocalLock entries and when finding even
> one, give up
> * logging the plan.
> */
> + start = clock();
> hash_seq_init(&status, GetLockMethodLocalHash());
> while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) !=
> NULL)
> {
> if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
> {
> ereport(LOG_SERVER_ONLY,
> errmsg("ignored request for logging query plan due
> to page lock conflicts"),
> errdetail("You can try again in a moment."));
> hash_seq_term(&status);
>
> ProcessLogQueryPlanInterruptActive = false;
> return;
> }
> }
> + end = clock();
> + cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
> +
> + ereport(LOG,
> + errmsg("all locallock entry search took: %f",
> cpu_time_used));
> +
>
> There were about 3 million log lines which recorded the cpu time, and
> the duration was quite short:
>
> =# -- Extracted cpu_time_used from log and loaded it to cpu_time.d.
> =# select max(d), min(d), avg(d) from cpu_time ;
> max | min | avg
> ----------+-----+-----------------------
> 0.000116 | 0 | 4.706274625332238e-07
>
> I'm not certain that this is valid for actual use cases, but these
> results seem to suggest that it will not take that long.

What load did you run? I don't think any query in make check would
take say thousands of locks. The prologue refers to a very populated
lock hash table. I think that will happen if thousands of tables are
queried in a single query OR a query runs on a partitioned table with
thousands of partitions. May be we want to try that scenario.

>
>
> > 2. What happens if auto_explain is enabled in the backend and
> > pg_log_query_plan() is called on the same backend? Will they conflict?
> > I think we should add a test for the same.
>
> Hmm, I think they don't conflict since they just refer QueryDesc and
> don't modify it and don't use same objects for locking.
> (I imagine 'conflict' here is something like 'hard conflict' in
> replication[2].)

By conflict, I mean the two features behave weird when used together
e.g give wrong results or crash etc.

>
> Actually using both auto_explain and pg_log_query_plan() output each
> logs separately:
>
> (pid:62835)=# select pg_sleep(10);
> (pid:70000)=# select pg_log_query_plan(62835);
>
> (pid:70000)=# \! cat data/log/postgres.log
> ...
> 2024-02-06 21:44:17.837 JST [62835:4:0] LOG: 00000: query plan
> running on backend with PID 62835 is:
> Query Text: select pg_sleep(10);
> Result (cost=0.00..0.01 rows=1 width=4)
> Output: pg_sleep('10'::double precision)
> Query Identifier: 3506829283127886044
> 2024-02-06 21:44:17.837 JST [62835:5:0] LOCATION:
> ProcessLogQueryPlanInterrupt, explain.c:5336
> 2024-02-06 21:44:26.974 JST [62835:6:0] LOG: 00000: duration:
> 10000.868 ms plan:
> Query Text: select pg_sleep(10);
> Result (cost=0.00..0.01 rows=1 width=4) (actual
> time=10000.802..10000.804 rows=1 loops=1)
>
> > Using injection point support we should be able to add tests for
> > testing pg_log_query_plan behaviour when there are page locks held or
> > when auto_explain (with instrumentation) and pg_log_query_plan() work
> > on the same query plan. Use injection point to make the backend
> > running query wait at a suitable point to delay its execution and fire
> > pg_log_query_plan() from other backend. May be the same test could
> > examine the server log file to see if the plan is indeed output to the
> > server log file.
> >
> > Given that the feature will be used when the things have already gone
> > wrong, it should not make things more serious. So more testing and
> > especially automated would help.
>
> Thanks for the advice, it seems a good idea.
> I'm going to try to add tests using injection point.

Your test with pg_sleep() is a good basic test. But more involved
testing might need something like injection points.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2024-02-07 05:10:55 RE: speed up a logical replica setup
Previous Message Euler Taveira 2024-02-07 04:53:37 Re: speed up a logical replica setup