| From: | Nadav Shatz <nadav(at)tailorbrands(dot)com> |
|---|---|
| To: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
| Cc: | pgpool-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: Proposal: recent access based routing for primary-replica setups |
| Date: | 2025-08-18 14:11:42 |
| Message-ID: | CACeKOO1-qront3LzcxOwjJBJz_jGYE9av4SrBa90SpTydPvY=Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgpool-hackers |
Hi Tatsuo,
Thank you very much for your reply and questions!
I'll try and respond to everything inline, please let me know if I missed
something or if anything isn't clear enough.
On Mon, Aug 18, 2025 at 3:51 PM Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
> Hello Nadav,
>
> Thank you for the proposal. I have a few questions.
>
> > Hello all,
> >
> > My name is Nadav Shatz, I’m the CTO at Tailor Brands and have been
> working
> > with PostgreSQL in high-traffic, distributed environments for many years.
> > Most of my focus has been on backend architecture, scaling, and
> performance
> > optimization, and I’m a long-time user and admirer of the Postgres
> > ecosystem.
> >
> > I’d like to propose adding a feature to pgpool-II for *recent access
> based
> > routing* in primary-replica setups. The idea is similar to what we’ve
> > described in this article
> > <
> https://medium.com/tailor-tech/using-database-read-replicas-in-distributed-systems-d80eaf6bbf8a
> >,
> > and is also reflected in this pgcat PR
> > <https://github.com/postgresml/pgcat/pull/864>. The core concept is to
> > route read queries to the primary if they occur shortly after a write,
> > reducing replica lag inconsistencies while still benefiting from read
> > scaling.
> >
> > *How it would work (high-level):*
> >
> >
> > -
> >
> >
> > *External “effective lag” via config (hot-reloaded): *Instead of relying
> on
> > pgpool-II’s replication delay checks (which don’t map well to Aurora
> > semantics), we’ll expose a *config value* representing the effective
> > replica lag (or directly the TTL to use for “recency”). This value
> > is *pushed
> > by an external controller* and *hot-reloaded* (no restarts). The
> > relevant knobs might look like:
> >
> > -
> >
> > enable_recent_access_routing (boolean, default off)
> > -
> >
> > recent_access_ttl_ms (integer, default 0, can be hot-reloaded)
>
> If my understanding is correct, the "external controller" updates
> "recent_access_ttl_ms" to let pgpool know the current delay of
> replica. My question is, what if there are multiple replicas. In this
> case the "external controller" calculates the average latency of each
> replica?
>
> Another question is, how often the external controller updates and
> reload pgpool.conf. If it's like every second, probably it could give
> unacceptable load to pgpool because reloading pgpool.conf is expensive
> operation.
>
>
You understood correctly - my plan was to keep it as generic as possible
and leave all logic to be handled by the external controller. Basically
leaving all of these decisions (how often to update, calculation, etc.) to
the external implementation as it can get very case specific.
This approach comes from the need of replica lag understanding under AWS
Aurora - which doesn't expose these metrics from the DB itself.
I also thought of implementing a couple of other possible mechanisms:
1. use a pcp command like you suggest below, i wasn't aware of the option,
this will handle the expensive operation but no other concerns mentioned.
2. we can implement support to using the AWS Aurora API directly for the
lag, while this is cloud provider and db "flavor" specific, it is a very
large and common use case. Doing this will open up all other pgpool
features that rely on the lag values being available. From a performance
perspective it is probably best.
> enable_query_parser (boolean, required for this feature, default
> off)
>
> What does this do? Why do you need this?
>
this was referring to enabling the auto routing already existing in pgpool
(based on query content), the naming is wrong.
basically meant to say - if the auto routing is disabled, there is no point
in enabling the latest access based routing.
Sorry for the confusion.
>
> > *In-memory recent-access map: *Each worker maintains a lightweight per-DB
> > in-memory map of *recently written relations*. On any write
>
> Is "per-DB in-memory map" in shared memory?
>
Yes
>
> > (INSERT/UPDATE/DELETE/UPSERT/TRUNCATE), we record the touched
> relations
> > with a TTL derived from recent_access_ttl_ms. Entries expire
> > automatically; writes refresh them.
>
> How do you automatically expire the entries? Are you going to
> implement something like a auto sweeper process?
>
Great question - maybe combine that with a lazy deletion process on read.
similar to what memcached is doing.
>
> > *Routing + query parsing: *For incoming statements we parse enough to
> > answer two questions: (1) is it a read or a write? and (2) which
> relations
> > are referenced? If a read touches any “recently written” relation, we
> *force
> > route to primary*; otherwise we allow normal read load-balancing to
> > replicas.
>
> Pgpool-II already does (1) and (2).
>
1 - of course, i'm trying to build on top of it.
2 - maybe i'm not understanding the existing documentation correctly - but
i couldn't find something that takes the specific relations (tables) under
consideration, only query type (Read/Write) or passing the delay_threshold.
Our approach here basically accepts no delay for these specific relations.
so you get guaranteed data freshness at the expense of checking the
specific table. it's a different kind of tradeoff.
the whole approach can be expanded to take further "generic values" under
considerations if needed to also take "tenant" id for instance under
consideration. tho for those cases, using a table per tenant already solves
that.
Please let me know what i might be missing here.
>
> > *Notes on behavior & ops:*
> >
> >
> > -
> >
> > *Config & hot reload:* Operators (or an external controller) can
> update
> > recent_access_ttl_ms dynamically and trigger hot reload to adapt to
> > changing conditions―no reliance on Aurora internals.
> > -
> >
> > *Safety levers:* a global max TTL, optional allow/deny lists, and
> > metrics (e.g., “reads forced to primary due to recency”) for
> visibility.
>
> Please elaborate more on this. Allow/deny what?
>
We can add "table list" that would ignore the feature, or in reverse as an
allow list that would enable it only for specific tables. I don't think
that's needed, especially not for V1.
> > *Defaults & compatibility:* all defaults are safe/off; enabling
> requires
> > explicit opt-in.
>
> Sounds good.
>
> > I’ll prepare the code changes and send a patch/PR, but before diving in I
> > wanted to check if anyone has *objections, concerns, or preferred
> > alternatives*―particularly around parser hooks, shared memory use, or
> > hot-reload mechanics in pgpool-II.
>
> Probably you should consider adding a pcp command to notice pgpool the
> "recent_access_ttl_ms". That is far more efficient than reloading
> pgpool.conf.
Great idea! i wasn't aware of the mechanism to be honest.
lastly another note that came up - we can disable the feature and load
balancing in case that we have to evict old items in the map. or have it
configurable how to behave in such a scenario.
>
> > Thanks for considering,
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
Best regards,
--
Nadav Shatz
Tailor Brands | CTO
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tatsuo Ishii | 2025-08-20 06:16:46 | Enhance watchdog_setup |
| Previous Message | Tatsuo Ishii | 2025-08-18 12:51:06 | Re: Proposal: recent access based routing for primary-replica setups |