Re: RFC: Logging plan of the running query

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Étienne BERSAC <etienne(dot)bersac(at)dalibo(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, jtc331(at)gmail(dot)com, rafaelthca(at)gmail(dot)com
Subject: Re: RFC: Logging plan of the running query
Date: 2024-02-07 10:14:40
Message-ID: a82b1a3f677d2a3aa95083c394e12843@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2024-02-07 13:58, Ashutosh Bapat wrote:
> On Wed, Feb 7, 2024 at 9:38 AM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
> wrote:
>>
>> Hi Ashutosh,
>>
>> On 2024-02-06 19:51, Ashutosh Bapat wrote:
>>
>> > Thanks for the summary. It is helpful. I think patch is also getting
>> > better.
>> >
>> > I have a few questions and suggestions
>>
>> Thanks for your comments.
>>
>> > 1. Prologue of GetLockMethodLocalHash() mentions
>> > * NOTE: When there are many entries in LockMethodLocalHash, calling
>> > this
>> > * function and looking into all of them can lead to performance
>> > problems.
>> > */
>> > How bad this performance could be. Let's assume that a query is taking
>> > time and pg_log_query_plan() is invoked to examine the plan of this
>> > query. Is it possible that the looping over all the locks itself takes
>> > a lot of time delaying the query execution further?
>>
>> I think it depends on the number of local locks, but I've measured cpu
>> time for this page lock check by adding below codes and
>> v27-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch[1], which
>> calls ProcessLogQueryPlanInterrupt() in every CFI on my laptop just
>> for
>> your information:
>>
>> diff --git a/src/backend/commands/explain.c
>> b/src/backend/commands/explain.c
>> index 5f7d77d567..65b7cb4925 100644
>> --- a/src/backend/commands/explain.c
>> +++ b/src/backend/commands/explain.c
>> @@ -44,6 +44,8 @@
>>
>> +#include "time.h"
>> ...
>> @@ -5287,6 +5292,7 @@ ProcessLogQueryPlanInterrupt(void)
>> * we check all the LocalLock entries and when finding even
>> one, give up
>> * logging the plan.
>> */
>> + start = clock();
>> hash_seq_init(&status, GetLockMethodLocalHash());
>> while ((locallock = (LOCALLOCK *) hash_seq_search(&status))
>> !=
>> NULL)
>> {
>> if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
>> {
>> ereport(LOG_SERVER_ONLY,
>> errmsg("ignored request for logging query plan
>> due
>> to page lock conflicts"),
>> errdetail("You can try again in a moment."));
>> hash_seq_term(&status);
>>
>> ProcessLogQueryPlanInterruptActive = false;
>> return;
>> }
>> }
>> + end = clock();
>> + cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
>> +
>> + ereport(LOG,
>> + errmsg("all locallock entry search took: %f",
>> cpu_time_used));
>> +
>>
>> There were about 3 million log lines which recorded the cpu time, and
>> the duration was quite short:
>>
>> =# -- Extracted cpu_time_used from log and loaded it to cpu_time.d.
>> =# select max(d), min(d), avg(d) from cpu_time ;
>> max | min | avg
>> ----------+-----+-----------------------
>> 0.000116 | 0 | 4.706274625332238e-07
>>
>> I'm not certain that this is valid for actual use cases, but these
>> results seem to suggest that it will not take that long.
>
> What load did you run? I don't think any query in make check would
> take say thousands of locks.

Sorry, I forgot to write it but ran make check as you imagined.

> The prologue refers to a very populated
> lock hash table. I think that will happen if thousands of tables are
> queried in a single query OR a query runs on a partitioned table with
> thousands of partitions. May be we want to try that scenario.

OK, I'll try such cases.

>> > 2. What happens if auto_explain is enabled in the backend and
>> > pg_log_query_plan() is called on the same backend? Will they conflict?
>> > I think we should add a test for the same.
>>
>> Hmm, I think they don't conflict since they just refer QueryDesc and
>> don't modify it and don't use same objects for locking.
>> (I imagine 'conflict' here is something like 'hard conflict' in
>> replication[2].)
>
> By conflict, I mean the two features behave weird when used together
> e.g give wrong results or crash etc.
>
>>
>> Actually using both auto_explain and pg_log_query_plan() output each
>> logs separately:
>>
>> (pid:62835)=# select pg_sleep(10);
>> (pid:70000)=# select pg_log_query_plan(62835);
>>
>> (pid:70000)=# \! cat data/log/postgres.log
>> ...
>> 2024-02-06 21:44:17.837 JST [62835:4:0] LOG: 00000: query plan
>> running on backend with PID 62835 is:
>> Query Text: select pg_sleep(10);
>> Result (cost=0.00..0.01 rows=1 width=4)
>> Output: pg_sleep('10'::double precision)
>> Query Identifier: 3506829283127886044
>> 2024-02-06 21:44:17.837 JST [62835:5:0] LOCATION:
>> ProcessLogQueryPlanInterrupt, explain.c:5336
>> 2024-02-06 21:44:26.974 JST [62835:6:0] LOG: 00000: duration:
>> 10000.868 ms plan:
>> Query Text: select pg_sleep(10);
>> Result (cost=0.00..0.01 rows=1 width=4) (actual
>> time=10000.802..10000.804 rows=1 loops=1)
>>
>> > Using injection point support we should be able to add tests for
>> > testing pg_log_query_plan behaviour when there are page locks held or
>> > when auto_explain (with instrumentation) and pg_log_query_plan() work
>> > on the same query plan. Use injection point to make the backend
>> > running query wait at a suitable point to delay its execution and fire
>> > pg_log_query_plan() from other backend. May be the same test could
>> > examine the server log file to see if the plan is indeed output to the
>> > server log file.
>> >
>> > Given that the feature will be used when the things have already gone
>> > wrong, it should not make things more serious. So more testing and
>> > especially automated would help.
>>
>> Thanks for the advice, it seems a good idea.
>> I'm going to try to add tests using injection point.
>
> Your test with pg_sleep() is a good basic test. But more involved
> testing might need something like injection points.

It might be so, I will consider whether there are any subtle timing
issues, etc.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-02-07 10:16:06 Re: Possibility to disable `ALTER SYSTEM`
Previous Message Alvaro Herrera 2024-02-07 10:13:24 Re: pgsql: Add EXPLAIN (MEMORY) to report planner memory consumption