| From: | Nadav Shatz <nadav(at)tailorbrands(dot)com> |
|---|---|
| To: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
| Cc: | pgpool-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: Proposal: Recent mutated table tracking in memory |
| Date: | 2026-05-20 12:25:54 |
| Message-ID: | CACeKOO2eUrfo_UDMFSEd=2y8zj8y93m38EzRCpg1HuizYBf3wA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgpool-hackers |
Hi Tatsuo,
Thanks for checking the V3, sorry for missing the test issue.
I reproduced the timeout locally. Found and fixed the root cause.
Root cause
----------
In CommandComplete.c, the autocommit write-tracking code was
gated only on session_context->is_in_transaction, not on the
cluster mode.
In native replication and snapshot isolation modes,
dml_adaptive() is never called (it lives inside
where_to_send_main_replica), so is_in_transaction is never set
to true even inside an explicit BEGIN/COMMIT block. That meant
every DML in those modes was treated as autocommit by the
write-tracking code, triggering
pool_track_table_mutation_get_database_oid() — which does a
relcache do_query — while a transaction was actually in flight
on the backend connection. The do_query conflicts with the
in-flight transaction and hangs the session. Subsequent
shutdown then hangs in terminate_all_childrens / waitpid.
Fix
---
Gate the autocommit write-tracking in CommandComplete.c on
MAIN_REPLICA in addition to the existing checks.
dml_adaptive_global is only meaningful in streaming replication
mode anyway (the matching routing logic in
where_to_send_main_replica is already SR-only), so this just
makes the autocommit path consistent.
Also broadened the query cache bypass to all dml_adaptive*
modes. The new helper pool_has_dml_adaptive_write_in_transaction()
checks the existing memqcache DML oid buffer (oidbufp via the
new pool_has_dml_table_oids()), which is populated for any DML
in any cluster mode and reset on transaction boundary. This
fixes the original "SELECT returns stale 1 instead of 2 after
UPDATE" regression in streaming replication and avoids the same
class of bug in plain dml_adaptive too.
Verified
--------
- 006.memqcache with disable_load_balance_on_write =
'dml_adaptive_global' appended in all three modes: PASS
- 043.track_table_mutation: PASS
Attached: v4-0001-Feature-load-balancing-control-by-table-tracking.patch
Thanks!
On Wed, May 20, 2026 at 7:28 AM Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
> > Hi Nadav,
> >
> > Sorry, I missed your last email.
> > Will check & test tomorrow.
>
> I finally got a chance to test your v3 patch.
> Unfortunately the test failed with timeout again.
>
> testing 006.memqcache...timeout.
> out of 1 ok:0 failed:0 timeout:1
>
> From src/test/regression/log/006.memqcache:
>
> 2026-05-20 13:08:33.798: main pid 3562591: LOG: stop request sent to
> pgpool (pid: 3561918). waiting for termination...
> .....2026-05-20 13:08:38.799: main pid 3562591: LOG: stop request sent to
> pgpool (pid: 3561918). waiting for termination...
> .....2026-05-20 13:08:43.801: main pid 3562591: LOG: stop request sent to
> pgpool (pid: 3561918). waiting for termination...
>
> It seems pgpool main process won't stop.
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
--
Nadav Shatz
Tailor Brands | CTO
| Attachment | Content-Type | Size |
|---|---|---|
| v4-0001-Feature-load-balancing-control-by-table-tracking.patch | application/octet-stream | 94.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tatsuo Ishii | 2026-05-21 09:50:44 | Re: Proposal: Recent mutated table tracking in memory |
| Previous Message | Tatsuo Ishii | 2026-05-20 04:28:10 | Re: Proposal: Recent mutated table tracking in memory |