Re: RFC: Logging plan of the running query

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>, Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>, david(dot)christensen(at)crunchydata(dot)com, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: RFC: Logging plan of the running query
Date: 2023-10-06 12:58:46
Message-ID: 7ab57ab0ba3416e2dbf54927752bc310@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2023-10-04 03:00, James Coleman wrote:
> On Thu, Sep 7, 2023 at 2:09 AM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
> wrote:
>>
>> On 2023-09-06 11:17, James Coleman wrote:
>>
>> >> > I've never been able to reproduce it (haven't tested the new version,
>> >> > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian
>> >> > (first buster and now bullseye).
>> >> >
>> >> > I'm attaching several stacktraces in the hope that they provide some
>> >> > clues. These all match the ps output I sent earlier, though note in
>> >> > that output there is both the regress instance and my test instance
>> >> > (pid 3213249) running (different ports, of course, and they are from
>> >> > the exact same compilation run). I've attached ps output for the
>> >> > postgres processes under the make check process to simplify cross
>> >> > referencing.
>> >> >
>> >> > A few interesting things:
>> >> > - There's definitely a lock on a relation that seems to be what's
>> >> > blocking the processes.
>> >> > - When I try to connect with psql the process forks but then hangs
>> >> > (see the ps output with task names stuck in "authentication"). I've
>> >> > also included a trace from one of these.
>> >>
>> >> Thanks for sharing them!
>> >>
>> >> Many processes are waiting to acquire the LW lock, including the
>> >> process
>> >> trying to output the plan(select1.trace).
>> >>
>> >> I suspect that this is due to a lock that was acquired prior to being
>> >> interrupted by ProcessLogQueryPlanInterrupt(), but have not been able
>> >> to
>> >> reproduce the same situation..
>> >>
>> >
>> > I don't have time immediately to dig into this, but perhaps loading up
>> > the core dumps would allow us to see what query is running in each
>> > backend process (if it hasn't already been discarded by that point)
>> > and thereby determine what point in each test process led to the error
>> > condition?
>>
>> Thanks for the suggestion.
>> I am concerned that core dumps may not be readable on different
>> operating systems or other environments. (Unfortunately, I do not have
>> Debian on hand)
>>
>> It seems that we can know what queries were running from the stack
>> traces you shared.
>> As described above, I suspect a lock which was acquired prior to
>> ProcessLogQueryPlanInterrupt() caused the issue.
>> I will try a little more to see if I can devise a way to create the
>> same
>> situation.
>>
>> > Alternatively we might be able to apply the same trick to the test
>> > client instead...
>> >
>> > BTW, for my own easy reference in this thread: relid 1259 is pg_class
>> > if I'm not mistaken.
>>
>> Yeah, and I think it's strange that the lock to 1259 appears twice and
>> must be avoided.
>>
>> #10 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1)
>> at
>> lmgr.c:117
>> ..
>> #49 0x0000559d61b4989d in ProcessLogQueryPlanInterrupt () at
>> explain.c:5158
>> ..
>> #53 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1)
>> at
>> lmgr.c:117
>
> I chatted with Andres and David about this at PGConf.NYC,
Thanks again for the discussion with hackers!

> and I think
> what we need to do is explicitly disallow running this code any time
> we are inside of lock acquisition code.

Yeah, I think it's a straightforward workaround.
And I'm also concerned that the condition of being in the process
of acquiring some kind of lock is too strict and will make it almost
impossible to output a running plan.

Anyway I'm going to implement it and run pg_log_query_plan()
while the regression test is running to see how successful the plan
output is.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2023-10-06 13:00:13 RE: [PoC] pg_upgrade: allow to upgrade publisher nodeHayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Previous Message Ashutosh Bapat 2023-10-06 12:57:36 Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression