Re: Proposal: Recent mutated table tracking in memory

From: Nadav Shatz <nadav(at)tailorbrands(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: pgpool-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Proposal: Recent mutated table tracking in memory
Date: 2026-05-20 12:25:54
Message-ID: CACeKOO2eUrfo_UDMFSEd=2y8zj8y93m38EzRCpg1HuizYBf3wA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgpool-hackers

Hi Tatsuo,

Thanks for checking the V3, sorry for missing the test issue.

I reproduced the timeout locally. Found and fixed the root cause.

Root cause
----------

In CommandComplete.c, the autocommit write-tracking code was
gated only on session_context->is_in_transaction, not on the
cluster mode.

In native replication and snapshot isolation modes,
dml_adaptive() is never called (it lives inside
where_to_send_main_replica), so is_in_transaction is never set
to true even inside an explicit BEGIN/COMMIT block. That meant
every DML in those modes was treated as autocommit by the
write-tracking code, triggering
pool_track_table_mutation_get_database_oid() — which does a
relcache do_query — while a transaction was actually in flight
on the backend connection. The do_query conflicts with the
in-flight transaction and hangs the session. Subsequent
shutdown then hangs in terminate_all_childrens / waitpid.

Fix
---

Gate the autocommit write-tracking in CommandComplete.c on
MAIN_REPLICA in addition to the existing checks.
dml_adaptive_global is only meaningful in streaming replication
mode anyway (the matching routing logic in
where_to_send_main_replica is already SR-only), so this just
makes the autocommit path consistent.

Also broadened the query cache bypass to all dml_adaptive*
modes. The new helper pool_has_dml_adaptive_write_in_transaction()
checks the existing memqcache DML oid buffer (oidbufp via the
new pool_has_dml_table_oids()), which is populated for any DML
in any cluster mode and reset on transaction boundary. This
fixes the original "SELECT returns stale 1 instead of 2 after
UPDATE" regression in streaming replication and avoids the same
class of bug in plain dml_adaptive too.

Verified
--------

- 006.memqcache with disable_load_balance_on_write =
'dml_adaptive_global' appended in all three modes: PASS
- 043.track_table_mutation: PASS

Attached: v4-0001-Feature-load-balancing-control-by-table-tracking.patch

Thanks!

On Wed, May 20, 2026 at 7:28 AM Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:

> > Hi Nadav,
> >
> > Sorry, I missed your last email.
> > Will check & test tomorrow.
>
> I finally got a chance to test your v3 patch.
> Unfortunately the test failed with timeout again.
>
> testing 006.memqcache...timeout.
> out of 1 ok:0 failed:0 timeout:1
>
> From src/test/regression/log/006.memqcache:
>
> 2026-05-20 13:08:33.798: main pid 3562591: LOG: stop request sent to
> pgpool (pid: 3561918). waiting for termination...
> .....2026-05-20 13:08:38.799: main pid 3562591: LOG: stop request sent to
> pgpool (pid: 3561918). waiting for termination...
> .....2026-05-20 13:08:43.801: main pid 3562591: LOG: stop request sent to
> pgpool (pid: 3561918). waiting for termination...
>
> It seems pgpool main process won't stop.
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>

--
Nadav Shatz
Tailor Brands | CTO

Attachment Content-Type Size
v4-0001-Feature-load-balancing-control-by-table-tracking.patch application/octet-stream 94.0 KB

In response to

Responses

Browse pgpool-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2026-05-21 09:50:44 Re: Proposal: Recent mutated table tracking in memory
Previous Message Tatsuo Ishii 2026-05-20 04:28:10 Re: Proposal: Recent mutated table tracking in memory