Re: Add max_wal_replay_size connection parameter to libpq

From: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
To: Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add max_wal_replay_size connection parameter to libpq
Date: 2026-03-29 18:31:49
Message-ID: CAHg+QDeO4cbOKJSx0+yzD00a_1xv_z9dLrE4nKkxb1A+KEoKeg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Jim,

On Sun, Mar 29, 2026 at 10:56 AM Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>
wrote:

> Hi,
>
> When connecting with target_session_attrs=standby (or prefer-standby,
> read-only, any) and multiple standbys are available, libpq currently
> selects the first acceptable candidate without regard for how "current"
> its data is. A standby configured with recovery_min_apply_delay,
> experiencing slow I/O, or otherwise lagging is treated the same as one
> that is fully caught up.
>
> I would like to propose a new libpq connection parameter,
> max_wal_replay_size, that allows clients to skip standby servers whose
> WAL replay backlog exceeds a given threshold.
>
> Example:
>
> psql "host=host1,host2,host3 port=5111,5222,5333 \
> target_session_attrs=standby max_wal_replay_size=16MB"
>
> When this parameter is set, libpq executes a small query during
> connection establishment to evaluate:
>
> pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn())
>
> on the standby. If the result exceeds the specified threshold, the
> server is skipped and the next host in the list is tried. The check is
> skipped entirely when target_session_attrs is set to primary or
> read-write, since those modes already exclude standbys.
>

What if none of them meets the criteria? You fail the connection? Wouldn't
it cause an availability issue?

>
> If pg_last_wal_receive_lsn() is NULL (e.g. no active WAL receiver due to
> missing primary_conninfo or a disconnected upstream), the backlog cannot
> be determined. In that case, the standby is treated as exceeding the
> threshold and is skipped.
>

When a standby is replaying archiving log, it can still be caught up. This
doesn't seem right to me.

>
> This parameter measures only the apply lag on the standby itself, i.e.,
> how much already-received WAL remains to be replayed. It does not
> attempt to measure how far the standby is behind the primary. In
> particular, a standby that is slow to receive WAL but fast to replay it
> may report a small backlog here while still being significantly behind.
>

IMHO, this change appears to not meet the objective of routing
connections/queries to the most up-to-date standby.

Thanks,
Satya

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-03-29 18:33:44 Re: docs: warn about post-data-only schema dumps with parallel restore.
Previous Message KAZAR Ayoub 2026-03-29 18:23:44 Re: Add pg_stat_vfdcache view for VFD cache statistics