120.memory_leak_extended_memqcache fails on master

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: pgpool-hackers(at)lists(dot)postgresql(dot)org
Subject: 120.memory_leak_extended_memqcache fails on master
Date: 2026-04-30 10:15:09
Message-ID: 20260430.191509.901412343921235299.ishii@postgresql.org
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgpool-hackers

Summary:

- Recent 120.memory_leak_extended_memqcache regression test failures
on master branch do not mean we have memory leaks with query cache
and/or extended query processing. It's false positive because the
test script is not prepared for child process going away by FATAL
error.

- The FATAL error is caused by an oversight in commit a350afd70
"Feature: send sync message only to necessary backends.".

- Proposed fix attached.

Recently 120.memory_leak_extended_memqcache regression test
occasionally fails on master. So I dug into the buildfarm log and
found following error. Note that in this case the load balance node is
1. If it is 0, the test succeeds.

The test runs pgbench and the log below emitted at the end of
bench marking. When pgbench finishes, pgpool executes queries
registered in reset_query_list. "DISCARD ALL" comes from it.

2026-04-29 01:00:58.287: pgbench pid 62605: LOG: DB node id: 1 backend pid: 62625 statement: Sync
2026-04-29 01:00:58.287: pgbench pid 62605: LOG: DB node id: 0 backend pid: 62624 statement: DISCARD ALL
2026-04-29 01:00:58.287: pgbench pid 62605: LOG: DB node id: 1 backend pid: 62625 statement: DISCARD ALL
2026-04-29 01:00:58.288: pgbench pid 62605: LOG: pool_send_and_wait: Error or notice message from backend: DB node id: 0 backend pid: 62624 statement: "DISCARD ALL" message: "DISCARD ALL cannot run inside a transaction block"
2026-04-29 01:00:58.288: pgbench pid 62605: WARNING: packet kind of backend 1 ['C'] does not match with main/majority nodes packet kind ['E']
2026-04-29 01:00:58.288: pgbench pid 62605: FATAL: failed to read kind from backend
2026-04-29 01:00:58.288: pgbench pid 62605: DETAIL: kind mismatch among backends. Possible last query was: "DISCARD ALL" kind details are:0[E: DISCARD ALL cannot run inside a transaction block] 1[C]
2026-04-29 01:00:58.288: pgbench pid 62605: HINT: check data consistency among db nodes

The FATAL error terminated child process 62605, caues the test
failure. From 129 test's test.sh:

# run pgbench for a while
$PGBENCH -M extended -S -T 30 test

after_size=`ps l $pid|tail -1|awk '{print $7}'`
delta=`expr $after_size - $init_size`

echo "initial process size: $init_size after size: $after_size delta: $delta"

test $delta -eq 0

if [ $? != 0 ];then
echo "memory leak in $delta KB in mode:$mode"
./shutdownall
exit 1
fi

and from src/test/regression/log/120.memory_leak_extended_memqcache:

expr: non-integer argument
initial process size: 156900 after size: VSZ delta:
./test.sh: line 59: test: -eq: unary operator expected
memory leak in KB in mode:s

Since ps command does not see the data of process 62605 any more, the
"test $delta -eq 0" command failed and it makes the script believe that
there's a memory leak. Probably we should fix test.sh here but I think
FATAL error should be fixed in the first place. Once we fix it,
there's no need to fix test.sh. So I think we can leave test.sh as it
is now.

The reason why DISCARD ALL failed on node 0 PostgreSQL is, DISCARD ALL
executes inside a transaction block according to the log. However the
queries executed by pgbench do not use an explicit transactions. I was
puzzled. So I look into PostgreSQL log on node 0. Grepping related
process (62624) from the PostgreSQL log shows this:

62624 2026-04-29 01:00:28.288 UTC LOG: execute pgpool62605/pgpool62605: SELECT count(*) FROM pg_catalog.pg_class AS c, pg_catalog.pg_namespace AS n WHERE c.relname = 'pgbench_accounts' AND c.relnamespace = n.oid AND n.nspname ~ '^pg_temp_'
62624 2026-04-29 01:00:29.068 UTC LOG: execute pgpool62605/pgpool62605: SELECT COALESCE(pg_catalog.to_regclass('"pgbench_accounts"')::oid, 0)
62624 2026-04-29 01:00:58.287 UTC LOG: statement: DISCARD ALL
62624 2026-04-29 01:00:58.287 UTC ERROR: DISCARD ALL cannot run inside a transaction block
62624 2026-04-29 01:00:58.287 UTC STATEMENT: DISCARD ALL

Those two execute commands were issued by do_query, which is
responsible for getting information for the relation cache
module. When pgpool runs under extended query mode and outside an
explicit transaction, do_query does not issue a sync message because
it could break unnamed portal. So in process 62624 on node 0 continues
to be in an extended query mode without sending sync. In this case
PostgreSQL keeps on opening an implicit transaction until sync message
is sent. Why pgpool did not send sync to node 0?

Commit a350afd70 allows to send sync messages only to necessary
backend nodes. Any queries comes from clients are tracked for this
purpose. However it forgot to the case when do_query generates
internal queries. I think we should track such that queries, so that a
sync message is issued to the server which do_query sent to, when the
client sends a sync message.

For this purpose I add a new member "bool
pending_sync_map[MAX_NUM_BACKENDS]" to session_context, which a node
id is set to when do_query sends a extended query to backend. Note
that it is set only when do_query is called outside an explicit
transaction. In add_sync_pending_message, which is called when client
sends a sync message, we check the map and set sync_map so that
subsequent call to SimpleForwardToBackend will send sync message to
the backend node.

Attached patch implements this.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Attachment Content-Type Size
v1-0001-Fix-to-send-sync-message-to-necessary-backend-nod.patch text/x-patch 6.6 KB

Responses

Browse pgpool-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2026-04-30 23:35:29 Re: 120.memory_leak_extended_memqcache fails on master
Previous Message Tatsuo Ishii 2026-04-26 05:56:19 Re: Problem with pcp process