Re: RFC: Logging plan of the running query

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>, Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>, david(dot)christensen(at)crunchydata(dot)com, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: RFC: Logging plan of the running query
Date: 2023-06-27 10:03:21
Message-ID: 00dac323d6732d59ba2c03be729966dc@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2023-06-16 01:34, James Coleman wrote:
> On Thu, Jun 15, 2023 at 9:00 AM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
> wrote:
>>
>> On 2023-06-15 01:48, James Coleman wrote:
>> > On Tue, Jun 13, 2023 at 11:53 AM James Coleman <jtc331(at)gmail(dot)com>
>> > wrote:
>> >>
>> >> ...
>> >> I'm going to re-run tests with my patch version + resetting the flag
>> >> on SIGINT (and any other error condition) to be certain that the issue
>> >> you uncovered (where backends get stuck after a SIGINT not responding
>> >> to the requested plan logging) wasn't masking any other issues.
>> >>
>> >> As long as that run is clean also then I believe the patch is safe
>> >> as-is even without the re-entrancy guard.
>> >>
>> >> I'll report back with the results of that testing.
>> >
>> > The tests have been running since last night, but have been apparently
>> > hung now for many hours. I haven't been able to fully look into it,
>> > but so far I know the hung (100% CPU) backend last logged this:
>> >
>> > 2023-06-14 02:00:30.045 UTC client backend[84461]
>> > pg_regress/updatable_views LOG: query plan running on backend with
>> > PID 84461 is:
>> > Query Text: SELECT table_name, column_name, is_updatable
>> > FROM information_schema.columns
>> > WHERE table_name LIKE E'r_\\_view%'
>> > ORDER BY table_name, ordinal_position;
>> >
>> > The last output from the regression test harness was:
>> >
>> > # parallel group (5 tests): index_including create_view
>> > index_including_gist create_index create_index_spgist
>> > ok 66 + create_index 36508 ms
>> > ok 67 + create_index_spgist 38588 ms
>> > ok 68 + create_view 1394 ms
>> > ok 69 + index_including 654 ms
>> > ok 70 + index_including_gist 1701 ms
>> > # parallel group (16 tests): errors create_cast drop_if_exists
>> > create_aggregate roleattributes constraints hash_func typed_table
>> > infinite_recurse
>> >
>> > Attaching gdb to the hung backend shows this:
>> >
>> > #0 0x00005601ab1f9529 in ProcLockWakeup
>> > (lockMethodTable=lockMethodTable(at)entry=0x5601ab6484e0
>> > <default_lockmethod>, lock=lock(at)entry=0x7f5325c913f0) at proc.c:1655
>> > #1 0x00005601ab1e99dc in CleanUpLock (lock=lock(at)entry=0x7f5325c913f0,
>> > proclock=proclock(at)entry=0x7f5325d40d60,
>> > lockMethodTable=lockMethodTable(at)entry=0x5601ab6484e0
>> > <default_lockmethod>,
>> > hashcode=hashcode(at)entry=573498161, wakeupNeeded=<optimized out>)
>> > at lock.c:1673
>> > #2 0x00005601ab1e9e21 in LockRefindAndRelease
>> > (lockMethodTable=lockMethodTable(at)entry=0x5601ab6484e0
>> > <default_lockmethod>, proc=<optimized out>,
>> > locktag=locktag(at)entry=0x5601ac3d7998, lockmode=lockmode(at)entry=1,
>> >
>> > decrement_strong_lock_count=decrement_strong_lock_count(at)entry=false)
>> > at lock.c:3150
>> > #3 0x00005601ab1edb27 in LockReleaseAll
>> > (lockmethodid=lockmethodid(at)entry=1, allLocks=false) at lock.c:2295
>> > #4 0x00005601ab1f8599 in ProcReleaseLocks
>> > (isCommit=isCommit(at)entry=true) at proc.c:781
>> > #5 0x00005601ab37f1f4 in ResourceOwnerReleaseInternal
>> > (owner=<optimized out>, phase=phase(at)entry=RESOURCE_RELEASE_LOCKS,
>> > isCommit=isCommit(at)entry=true, isTopLevel=isTopLevel(at)entry=true) at
>> > resowner.c:618
>> > #6 0x00005601ab37f7b7 in ResourceOwnerRelease (owner=<optimized out>,
>> > phase=phase(at)entry=RESOURCE_RELEASE_LOCKS,
>> > isCommit=isCommit(at)entry=true, isTopLevel=isTopLevel(at)entry=true) at
>> > resowner.c:494
>> > #7 0x00005601aaec1d84 in CommitTransaction () at xact.c:2334
>> > #8 0x00005601aaec2b22 in CommitTransactionCommand () at xact.c:3067
>> > #9 0x00005601ab200a66 in finish_xact_command () at postgres.c:2783
>> > #10 0x00005601ab20338f in exec_simple_query (
>> > query_string=query_string(at)entry=0x5601ac3b0858 "SELECT table_name,
>> > column_name, is_updatable\n FROM information_schema.columns\n WHERE
>> > table_name LIKE E'r_\\\\_view%'\n ORDER BY table_name,
>> > ordinal_position;") at postgres.c:1300
>> >
>> > I am unable to connect to the regression test Postgres instance --
>> > psql just hangs, so the lock seems to have affected the postmaster
>> > also.
>> >
>> > I'm wondering if this might represent a bug in the current patch.
>>
>> Thanks for running and analyzing the test!
>
> Sure thing!
>
>> Could you share me how you are running the test?
>>
>> I imagined something like below, but currently couldn't reproduce it.
>> - apply both v26-0001 and v27-0002 and build
>> - run PostgreSQL with default GUCssaaa
>> - make installcheck-world
>> - run 'SELECT pg_log_query_plan(pid) FROM pg_stat_activity \watch 0.1'
>> during make installcheck-world
>
> Apologies, I should have attached my updated patch (with the fix for
> the bug you'd reporting with the re-entrancy guard). Attached is v28
> which sets ProcessLogQueryPlanInterruptActive to false in errfinish
> when necessary. Once built with those two patches I'm simply running
> `make check`.

Thanks!

However, I haven't succeeded in reproducing the problem as below:

> The tests have been running since last night, but have been apparently
> hung now for many hours. I haven't been able to fully look into it,
> but so far I know the hung (100% CPU) backend last logged this:

Did you do something like this?

$ ./configure --prefix=/home/ubuntu/pgsql/master --enable-cassert
$ git apply
../patch/v28-0001-Add-function-to-log-the-plan-of-the-query.patch
../patch/v28-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch
$ make
$ make check

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2023-06-27 10:15:48 Re: Support logical replication of DDLs
Previous Message David Rowley 2023-06-27 09:19:26 Add bump memory context type and use it for tuplesorts