Re: RFC: Logging plan of the running query

From: James Coleman <jtc331(at)gmail(dot)com>
To: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>, Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>, david(dot)christensen(at)crunchydata(dot)com, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: RFC: Logging plan of the running query
Date: 2023-08-28 13:47:19
Message-ID: CAAaqYe_5p_MkBCgSYX=S7G3ZKi=2=_6bZQFzTgVYyv=1NRikpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 28, 2023 at 3:01 AM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
>
> On 2023-08-26 21:03, James Coleman wrote:
> > On Fri, Aug 25, 2023 at 7:43 AM James Coleman <jtc331(at)gmail(dot)com> wrote:
> >>
> >> On Thu, Aug 17, 2023 at 10:02 AM torikoshia
> >> <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
> >> >
> >> > On 2023-06-16 01:34, James Coleman wrote:
> >> > > Attached is v28
> >> > > which sets ProcessLogQueryPlanInterruptActive to false in errfinish
> >> > > when necessary. Once built with those two patches I'm simply running
> >> > > `make check`.
> >> >
> >> > With v28-0001 and v28-0002 patch, I confirmed backend processes consume
> >> > huge
> >> > amount of memory and under some environments they were terminated by OOM
> >> > killer.
> >> >
> >> > This was because memory was allocated from existing memory contexts and
> >> > they
> >> > were not freed after ProcessLogQueryPlanInterrupt().
> >> > Updated the patch to use dedicated memory context for
> >> > ProcessLogQueryPlanInterrupt().
> >> >
> >> > Applying attached patch and v28-0002 patch, `make check` successfully
> >> > completed after 20min and 50GB of logs on my environment.
> >> >
> >> > >>> On 2023-06-15 01:48, James Coleman wrote:
> >> > >>> > The tests have been running since last night, but have been apparently
> >> > >>> > hung now for many hours.
> >> >
> >> > I don't know if this has anything to do with the hung you faced, but I
> >> > thought
> >> > it might be possible that the large amount of memory usage resulted in
> >> > swapping, which caused a significant delay in processing.
> >>
> >> Ah, yes, I think that could be a possible explanation. I was delaying
> >> on this thread because I wasn't comfortable with having caused an
> >> issue once (even if I couldn't easily reproduce) without at least some
> >> theory as to the cause (and a fix).
> >>
> >> > If possible, I would be very grateful if you could try to reproduce this
> >> > with
> >> > the v29 patch.
> >>
> >> I'll kick off some testing.
> >>
> >
> > I don't have time to investigate what's happening here, but 24 hours
> > later the first "make check" is still running, and at first glance it
> > seems to have the same behavior I'd seen that first time. The test
> > output is to this point:
> >
> > # parallel group (5 tests): index_including create_view
> > index_including_gist create_index create_index_spgist
> > ok 66 + create_index 26365 ms
> > ok 67 + create_index_spgist 27675 ms
> > ok 68 + create_view 1235 ms
> > ok 69 + index_including 1102 ms
> > ok 70 + index_including_gist 1633 ms
> > # parallel group (16 tests): create_aggregate create_cast errors
> > roleattributes drop_if_exists hash_func typed_table create_am
> > infinite_recurse
> >
> > and it hasn't progressed past that point since at least ~16 hours ago
> > (the first several hours of the run I wasn't monitoring it).
> >
> > I haven't connected up gdb yet, and won't be able to until maybe
> > tomorrow, but here's the ps output for postgres processes that are
> > running:
> >
> > admin 3213249 0.0 0.0 196824 20552 ? Ss Aug25 0:00
> > /home/admin/postgresql-test/bin/postgres -D
> > /home/admin/postgresql-test-data
> > admin 3213250 0.0 0.0 196964 7284 ? Ss Aug25 0:00
> > postgres: checkpointer
> > admin 3213251 0.0 0.0 196956 4276 ? Ss Aug25 0:00
> > postgres: background writer
> > admin 3213253 0.0 0.0 196956 8600 ? Ss Aug25 0:00
> > postgres: walwriter
> > admin 3213254 0.0 0.0 198424 7316 ? Ss Aug25 0:00
> > postgres: autovacuum launcher
> > admin 3213255 0.0 0.0 198412 5488 ? Ss Aug25 0:00
> > postgres: logical replication launcher
> > admin 3237967 0.0 0.0 2484 516 pts/4 S+ Aug25 0:00
> > /bin/sh -c echo "# +++ regress check in src/test/regress +++" &&
> > PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/bin:/home/admin/postgres/src/test/regress:$PATH"
> > LD_LIBRARY_PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/lib"
> > INITDB_TEMPLATE='/home/admin/postgres'/tmp_install/initdb-template
> > ../../../src/test/regress/pg_regress --temp-instance=./tmp_check
> > --inputdir=. --bindir= --dlpath=. --max-concurrent-tests=20
> > --schedule=./parallel_schedule
> > admin 3237973 0.0 0.0 197880 20688 pts/4 S+ Aug25 0:00
> > postgres -D /home/admin/postgres/src/test/regress/tmp_check/data -F -c
> > listen_addresses= -k /tmp/pg_regress-7mmGUa
> > admin 3237976 0.0 0.1 198332 44608 ? Ss Aug25 0:00
> > postgres: checkpointer
> > admin 3237977 0.0 0.0 198032 4640 ? Ss Aug25 0:00
> > postgres: background writer
> > admin 3237979 0.0 0.0 197880 8580 ? Ss Aug25 0:00
> > postgres: walwriter
> > admin 3237980 0.0 0.0 199484 7608 ? Ss Aug25 0:00
> > postgres: autovacuum launcher
> > admin 3237981 0.0 0.0 199460 5488 ? Ss Aug25 0:00
> > postgres: logical replication launcher
> > admin 3243644 0.0 0.2 252400 74656 ? Ss Aug25 0:01
> > postgres: admin regression [local] SELECT waiting
> > admin 3243645 0.0 0.1 205480 33992 ? Ss Aug25 0:00
> > postgres: admin regression [local] SELECT waiting
> > admin 3243654 99.9 0.0 203156 31504 ? Rs Aug25 1437:49
> > postgres: admin regression [local] VACUUM
> > admin 3243655 0.0 0.1 212036 38504 ? Ss Aug25 0:00
> > postgres: admin regression [local] SELECT waiting
> > admin 3243656 0.0 0.0 206024 30892 ? Ss Aug25 0:00
> > postgres: admin regression [local] DELETE waiting
> > admin 3243657 0.0 0.1 205568 32232 ? Ss Aug25 0:00
> > postgres: admin regression [local] ALTER TABLE waiting
> > admin 3243658 0.0 0.0 203740 21532 ? Ss Aug25 0:00
> > postgres: admin regression [local] ANALYZE waiting
> > admin 3243798 0.0 0.0 199884 8464 ? Ss Aug25 0:00
> > postgres: autovacuum worker
> > admin 3244733 0.0 0.0 199492 5956 ? Ss Aug25 0:00
> > postgres: autovacuum worker
> > admin 3245652 0.0 0.0 199884 8468 ? Ss Aug25 0:00
> > postgres: autovacuum worker
> >
> > As you can see there are a bunch of backends presumably waiting, and
> > also the VACUUM process has been pegging a single CPU core for at
> > least since that ~16 hour ago mark.
> >
> > I hope to be able to do more investigation later, but I wanted to at
> > least give you this information now.
>
> Thanks a lot for testing the patch!
> I really appreciate your cooperation.
>
> Hmm, I also tested on the current HEAD(165d581f146b09) again on Ubuntu
> 22.04 and macOS, but unfortunately(fortunately?) they succeeded as
> below:
>
> ```
> $ git apply v29-0001-Add-function-to-log-the-plan-of-the-query.patch
> $ git apply v28-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch
> $ ./configure --enable-debug --enable-cassert
> $ make
> $ make check
>
> ...(snip)...
>
> # parallel group (5 tests): index_including index_including_gist
> create_view create_index create_index_spgist
> ok 66 + create_index 25033 ms
> ok 67 + create_index_spgist 26144 ms
> ok 68 + create_view 3061 ms
> ok 69 + index_including 976 ms
> ok 70 + index_including_gist 2998 ms
> # parallel group (16 tests): create_cast errors create_aggregate
> roleattributes drop_if_exists hash_func typed_table
> create_am select constraints updatable_views inherit triggers vacuum
> create_function_sql infinite_recurse
> ok 71 + create_aggregate 225 ms
> ok 72 + create_function_sql 18874 ms
> ok 73 + create_cast 168 ms
>
> ...(snip)...
>
> # All 215 tests passed.
> ```
>
> If you notice any difference, I would be grateful if you could let me
> know.

I've never been able to reproduce it (haven't tested the new version,
but v28 at least) on my M1 Mac; where I've reproduced it is on Debian
(first buster and now bullseye).

I'm attaching several stacktraces in the hope that they provide some
clues. These all match the ps output I sent earlier, though note in
that output there is both the regress instance and my test instance
(pid 3213249) running (different ports, of course, and they are from
the exact same compilation run). I've attached ps output for the
postgres processes under the make check process to simplify cross
referencing.

A few interesting things:
- There's definitely a lock on a relation that seems to be what's
blocking the processes.
- When I try to connect with psql the process forks but then hangs
(see the ps output with task names stuck in "authentication"). I've
also included a trace from one of these.

If you think a core file for any of these processes would be helpful
for debugging I'd be happy to try to figure out a way to get that to
you.

Regards,
James Coleman

Attachment Content-Type Size
ps.txt text/plain 2.0 KB
delete.trace application/octet-stream 5.7 KB
altertable.trace application/octet-stream 6.3 KB
analyze.trace application/octet-stream 5.0 KB
postmaster.trace application/octet-stream 1.1 KB
new_psql.trace application/octet-stream 3.6 KB
vacuum.trace application/octet-stream 4.3 KB
select3.trace application/octet-stream 6.4 KB
select1.trace application/octet-stream 19.2 KB
select2.trace application/octet-stream 5.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2023-08-28 13:51:07 Re: Disabling Heap-Only Tuples
Previous Message Peter Eisentraut 2023-08-28 13:46:35 Commitfest manager for September