Re: RFC: Logging plan of the running query

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>, Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>, david(dot)christensen(at)crunchydata(dot)com, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: RFC: Logging plan of the running query
Date: 2023-08-28 07:01:53
Message-ID: 4e1e4e5d022964953e84468fe4e511b0@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2023-08-26 21:03, James Coleman wrote:
> On Fri, Aug 25, 2023 at 7:43 AM James Coleman <jtc331(at)gmail(dot)com> wrote:
>>
>> On Thu, Aug 17, 2023 at 10:02 AM torikoshia
>> <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
>> >
>> > On 2023-06-16 01:34, James Coleman wrote:
>> > > Attached is v28
>> > > which sets ProcessLogQueryPlanInterruptActive to false in errfinish
>> > > when necessary. Once built with those two patches I'm simply running
>> > > `make check`.
>> >
>> > With v28-0001 and v28-0002 patch, I confirmed backend processes consume
>> > huge
>> > amount of memory and under some environments they were terminated by OOM
>> > killer.
>> >
>> > This was because memory was allocated from existing memory contexts and
>> > they
>> > were not freed after ProcessLogQueryPlanInterrupt().
>> > Updated the patch to use dedicated memory context for
>> > ProcessLogQueryPlanInterrupt().
>> >
>> > Applying attached patch and v28-0002 patch, `make check` successfully
>> > completed after 20min and 50GB of logs on my environment.
>> >
>> > >>> On 2023-06-15 01:48, James Coleman wrote:
>> > >>> > The tests have been running since last night, but have been apparently
>> > >>> > hung now for many hours.
>> >
>> > I don't know if this has anything to do with the hung you faced, but I
>> > thought
>> > it might be possible that the large amount of memory usage resulted in
>> > swapping, which caused a significant delay in processing.
>>
>> Ah, yes, I think that could be a possible explanation. I was delaying
>> on this thread because I wasn't comfortable with having caused an
>> issue once (even if I couldn't easily reproduce) without at least some
>> theory as to the cause (and a fix).
>>
>> > If possible, I would be very grateful if you could try to reproduce this
>> > with
>> > the v29 patch.
>>
>> I'll kick off some testing.
>>
>
> I don't have time to investigate what's happening here, but 24 hours
> later the first "make check" is still running, and at first glance it
> seems to have the same behavior I'd seen that first time. The test
> output is to this point:
>
> # parallel group (5 tests): index_including create_view
> index_including_gist create_index create_index_spgist
> ok 66 + create_index 26365 ms
> ok 67 + create_index_spgist 27675 ms
> ok 68 + create_view 1235 ms
> ok 69 + index_including 1102 ms
> ok 70 + index_including_gist 1633 ms
> # parallel group (16 tests): create_aggregate create_cast errors
> roleattributes drop_if_exists hash_func typed_table create_am
> infinite_recurse
>
> and it hasn't progressed past that point since at least ~16 hours ago
> (the first several hours of the run I wasn't monitoring it).
>
> I haven't connected up gdb yet, and won't be able to until maybe
> tomorrow, but here's the ps output for postgres processes that are
> running:
>
> admin 3213249 0.0 0.0 196824 20552 ? Ss Aug25 0:00
> /home/admin/postgresql-test/bin/postgres -D
> /home/admin/postgresql-test-data
> admin 3213250 0.0 0.0 196964 7284 ? Ss Aug25 0:00
> postgres: checkpointer
> admin 3213251 0.0 0.0 196956 4276 ? Ss Aug25 0:00
> postgres: background writer
> admin 3213253 0.0 0.0 196956 8600 ? Ss Aug25 0:00
> postgres: walwriter
> admin 3213254 0.0 0.0 198424 7316 ? Ss Aug25 0:00
> postgres: autovacuum launcher
> admin 3213255 0.0 0.0 198412 5488 ? Ss Aug25 0:00
> postgres: logical replication launcher
> admin 3237967 0.0 0.0 2484 516 pts/4 S+ Aug25 0:00
> /bin/sh -c echo "# +++ regress check in src/test/regress +++" &&
> PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/bin:/home/admin/postgres/src/test/regress:$PATH"
> LD_LIBRARY_PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/lib"
> INITDB_TEMPLATE='/home/admin/postgres'/tmp_install/initdb-template
> ../../../src/test/regress/pg_regress --temp-instance=./tmp_check
> --inputdir=. --bindir= --dlpath=. --max-concurrent-tests=20
> --schedule=./parallel_schedule
> admin 3237973 0.0 0.0 197880 20688 pts/4 S+ Aug25 0:00
> postgres -D /home/admin/postgres/src/test/regress/tmp_check/data -F -c
> listen_addresses= -k /tmp/pg_regress-7mmGUa
> admin 3237976 0.0 0.1 198332 44608 ? Ss Aug25 0:00
> postgres: checkpointer
> admin 3237977 0.0 0.0 198032 4640 ? Ss Aug25 0:00
> postgres: background writer
> admin 3237979 0.0 0.0 197880 8580 ? Ss Aug25 0:00
> postgres: walwriter
> admin 3237980 0.0 0.0 199484 7608 ? Ss Aug25 0:00
> postgres: autovacuum launcher
> admin 3237981 0.0 0.0 199460 5488 ? Ss Aug25 0:00
> postgres: logical replication launcher
> admin 3243644 0.0 0.2 252400 74656 ? Ss Aug25 0:01
> postgres: admin regression [local] SELECT waiting
> admin 3243645 0.0 0.1 205480 33992 ? Ss Aug25 0:00
> postgres: admin regression [local] SELECT waiting
> admin 3243654 99.9 0.0 203156 31504 ? Rs Aug25 1437:49
> postgres: admin regression [local] VACUUM
> admin 3243655 0.0 0.1 212036 38504 ? Ss Aug25 0:00
> postgres: admin regression [local] SELECT waiting
> admin 3243656 0.0 0.0 206024 30892 ? Ss Aug25 0:00
> postgres: admin regression [local] DELETE waiting
> admin 3243657 0.0 0.1 205568 32232 ? Ss Aug25 0:00
> postgres: admin regression [local] ALTER TABLE waiting
> admin 3243658 0.0 0.0 203740 21532 ? Ss Aug25 0:00
> postgres: admin regression [local] ANALYZE waiting
> admin 3243798 0.0 0.0 199884 8464 ? Ss Aug25 0:00
> postgres: autovacuum worker
> admin 3244733 0.0 0.0 199492 5956 ? Ss Aug25 0:00
> postgres: autovacuum worker
> admin 3245652 0.0 0.0 199884 8468 ? Ss Aug25 0:00
> postgres: autovacuum worker
>
> As you can see there are a bunch of backends presumably waiting, and
> also the VACUUM process has been pegging a single CPU core for at
> least since that ~16 hour ago mark.
>
> I hope to be able to do more investigation later, but I wanted to at
> least give you this information now.

Thanks a lot for testing the patch!
I really appreciate your cooperation.

Hmm, I also tested on the current HEAD(165d581f146b09) again on Ubuntu
22.04 and macOS, but unfortunately(fortunately?) they succeeded as
below:

```
$ git apply v29-0001-Add-function-to-log-the-plan-of-the-query.patch
$ git apply v28-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch
$ ./configure --enable-debug --enable-cassert
$ make
$ make check

...(snip)...

# parallel group (5 tests): index_including index_including_gist
create_view create_index create_index_spgist
ok 66 + create_index 25033 ms
ok 67 + create_index_spgist 26144 ms
ok 68 + create_view 3061 ms
ok 69 + index_including 976 ms
ok 70 + index_including_gist 2998 ms
# parallel group (16 tests): create_cast errors create_aggregate
roleattributes drop_if_exists hash_func typed_table
create_am select constraints updatable_views inherit triggers vacuum
create_function_sql infinite_recurse
ok 71 + create_aggregate 225 ms
ok 72 + create_function_sql 18874 ms
ok 73 + create_cast 168 ms

...(snip)...

# All 215 tests passed.
```

If you notice any difference, I would be grateful if you could let me
know.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo NAGATA 2023-08-28 07:05:30 Re: Incremental View Maintenance, take 2
Previous Message Peter Eisentraut 2023-08-28 06:45:39 Re: https://git.postgresql.org/git/postgresql.git/ fails