Re: RFC: Logging plan of the running query

From: James Coleman <jtc331(at)gmail(dot)com>
To: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>, Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>, david(dot)christensen(at)crunchydata(dot)com, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: RFC: Logging plan of the running query
Date: 2023-08-26 12:03:26
Message-ID: CAAaqYe9Mse7JC78DgT97sXoB6U+E5OL3zJAYRSsgy-TPP9PXmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 25, 2023 at 7:43 AM James Coleman <jtc331(at)gmail(dot)com> wrote:
>
> On Thu, Aug 17, 2023 at 10:02 AM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
> >
> > On 2023-06-16 01:34, James Coleman wrote:
> > > Attached is v28
> > > which sets ProcessLogQueryPlanInterruptActive to false in errfinish
> > > when necessary. Once built with those two patches I'm simply running
> > > `make check`.
> >
> > With v28-0001 and v28-0002 patch, I confirmed backend processes consume
> > huge
> > amount of memory and under some environments they were terminated by OOM
> > killer.
> >
> > This was because memory was allocated from existing memory contexts and
> > they
> > were not freed after ProcessLogQueryPlanInterrupt().
> > Updated the patch to use dedicated memory context for
> > ProcessLogQueryPlanInterrupt().
> >
> > Applying attached patch and v28-0002 patch, `make check` successfully
> > completed after 20min and 50GB of logs on my environment.
> >
> > >>> On 2023-06-15 01:48, James Coleman wrote:
> > >>> > The tests have been running since last night, but have been apparently
> > >>> > hung now for many hours.
> >
> > I don't know if this has anything to do with the hung you faced, but I
> > thought
> > it might be possible that the large amount of memory usage resulted in
> > swapping, which caused a significant delay in processing.
>
> Ah, yes, I think that could be a possible explanation. I was delaying
> on this thread because I wasn't comfortable with having caused an
> issue once (even if I couldn't easily reproduce) without at least some
> theory as to the cause (and a fix).
>
> > If possible, I would be very grateful if you could try to reproduce this
> > with
> > the v29 patch.
>
> I'll kick off some testing.
>

I don't have time to investigate what's happening here, but 24 hours
later the first "make check" is still running, and at first glance it
seems to have the same behavior I'd seen that first time. The test
output is to this point:

# parallel group (5 tests): index_including create_view
index_including_gist create_index create_index_spgist
ok 66 + create_index 26365 ms
ok 67 + create_index_spgist 27675 ms
ok 68 + create_view 1235 ms
ok 69 + index_including 1102 ms
ok 70 + index_including_gist 1633 ms
# parallel group (16 tests): create_aggregate create_cast errors
roleattributes drop_if_exists hash_func typed_table create_am
infinite_recurse

and it hasn't progressed past that point since at least ~16 hours ago
(the first several hours of the run I wasn't monitoring it).

I haven't connected up gdb yet, and won't be able to until maybe
tomorrow, but here's the ps output for postgres processes that are
running:

admin 3213249 0.0 0.0 196824 20552 ? Ss Aug25 0:00
/home/admin/postgresql-test/bin/postgres -D
/home/admin/postgresql-test-data
admin 3213250 0.0 0.0 196964 7284 ? Ss Aug25 0:00
postgres: checkpointer
admin 3213251 0.0 0.0 196956 4276 ? Ss Aug25 0:00
postgres: background writer
admin 3213253 0.0 0.0 196956 8600 ? Ss Aug25 0:00
postgres: walwriter
admin 3213254 0.0 0.0 198424 7316 ? Ss Aug25 0:00
postgres: autovacuum launcher
admin 3213255 0.0 0.0 198412 5488 ? Ss Aug25 0:00
postgres: logical replication launcher
admin 3237967 0.0 0.0 2484 516 pts/4 S+ Aug25 0:00
/bin/sh -c echo "# +++ regress check in src/test/regress +++" &&
PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/bin:/home/admin/postgres/src/test/regress:$PATH"
LD_LIBRARY_PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/lib"
INITDB_TEMPLATE='/home/admin/postgres'/tmp_install/initdb-template
../../../src/test/regress/pg_regress --temp-instance=./tmp_check
--inputdir=. --bindir= --dlpath=. --max-concurrent-tests=20
--schedule=./parallel_schedule
admin 3237973 0.0 0.0 197880 20688 pts/4 S+ Aug25 0:00
postgres -D /home/admin/postgres/src/test/regress/tmp_check/data -F -c
listen_addresses= -k /tmp/pg_regress-7mmGUa
admin 3237976 0.0 0.1 198332 44608 ? Ss Aug25 0:00
postgres: checkpointer
admin 3237977 0.0 0.0 198032 4640 ? Ss Aug25 0:00
postgres: background writer
admin 3237979 0.0 0.0 197880 8580 ? Ss Aug25 0:00
postgres: walwriter
admin 3237980 0.0 0.0 199484 7608 ? Ss Aug25 0:00
postgres: autovacuum launcher
admin 3237981 0.0 0.0 199460 5488 ? Ss Aug25 0:00
postgres: logical replication launcher
admin 3243644 0.0 0.2 252400 74656 ? Ss Aug25 0:01
postgres: admin regression [local] SELECT waiting
admin 3243645 0.0 0.1 205480 33992 ? Ss Aug25 0:00
postgres: admin regression [local] SELECT waiting
admin 3243654 99.9 0.0 203156 31504 ? Rs Aug25 1437:49
postgres: admin regression [local] VACUUM
admin 3243655 0.0 0.1 212036 38504 ? Ss Aug25 0:00
postgres: admin regression [local] SELECT waiting
admin 3243656 0.0 0.0 206024 30892 ? Ss Aug25 0:00
postgres: admin regression [local] DELETE waiting
admin 3243657 0.0 0.1 205568 32232 ? Ss Aug25 0:00
postgres: admin regression [local] ALTER TABLE waiting
admin 3243658 0.0 0.0 203740 21532 ? Ss Aug25 0:00
postgres: admin regression [local] ANALYZE waiting
admin 3243798 0.0 0.0 199884 8464 ? Ss Aug25 0:00
postgres: autovacuum worker
admin 3244733 0.0 0.0 199492 5956 ? Ss Aug25 0:00
postgres: autovacuum worker
admin 3245652 0.0 0.0 199884 8468 ? Ss Aug25 0:00
postgres: autovacuum worker

As you can see there are a bunch of backends presumably waiting, and
also the VACUUM process has been pegging a single CPU core for at
least since that ~16 hour ago mark.

I hope to be able to do more investigation later, but I wanted to at
least give you this information now.

James Coleman

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2023-08-26 15:02:29 Re: broken master regress tests
Previous Message Hongxu Ma 2023-08-26 11:19:03 Re: PSQL error: total cell count of XXX exceeded