| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
| Cc: | pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, itsajin(at)gmail(dot)com, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
| Subject: | Re: Bound memory usage during manual slot sync retries |
| Date: | 2026-05-16 16:25:20 |
| Message-ID: | CABPTF7W6RgOSbZ9cjLv674zee4xum9CVyGe13uYg_1BdQgLG7w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Sat, May 16, 2026 at 7:35 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> On Fri, May 15, 2026 at 9:20 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> >
> > Hi Amit,
> >
> > On Fri, May 15, 2026 at 5:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Fri, May 15, 2026 at 11:02 AM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> > > >
> > > > pg_sync_replication_slots() now retries inside a single SQL function
> > > > call. Unlike the slotsync worker, it does not get a transaction boundary
> > > > between retry cycles, so allocations made while fetching and synchronizing
> > > > remote slots can accumulate until the function returns.
> > > >
> > > > The existing list_free_deep(remote_slots) is not enough to bound this.
> > > > It frees the List cells and the RemoteSlot structs stored as list
> > > > elements, but it does not recursively free allocations owned by those
> > > > structs, such as the copied slot name, plugin name, and database name.
> > > >
> > >
> > > Right.
> > >
> > > > It also does not release unrelated per-cycle allocations made while
> > > > fetching and processing the remote slots.
> > > >
> > >
> > > BTW, did you notice via test or otherwise, what and how much other
> > > unrelated memory is being allocated during each sync cycle if we
> > > manually free the allocations you observed? This is mainly to learn
> > > the impact of not doing all these allocations in some short-duration
> > > memory context.
> >
> > I noticed this by reading the feature code while walking through the
> > PG 19 release notes, not by observing actual memory growth in a
> > running system. Besides the RemoteSlot field strings, there seems to
> > be a few smaller per-cycle allocations that also accumulate:
> > quote_literal_cstr() strings used to build the filtered query, a
> > temporary TextDatumGetCString() result for invalidation_reason, the
> > standalone TupleTableSlot in fetch_remote_slots(), and the list
> > container built by get_local_synced_slots(). I chose a per-cycle
> > memory context over manual pfrees to make the retry-cycle lifetime
> > explicit and avoid maintaining a destructor for every current and
> > future allocation in this path. It may be worth measuring how much
> > extra memory actually accumulates during an extended wait to confirm
> > the practical impact.
>
> I did some measurements for the memory growth in the manual
> pg_sync_replication_slots() retry path.
>
> The test used 100 failover logical slots and forced
> pg_sync_replication_slots() to keep retrying on the standby. Memory
> was sampled with pg_log_backend_memory_contexts() on the backend
> running the function.
>
> On HEAD, the short run showed:
> timestamp total_bytes delta
> 2026-05-16T10:47:54Z,63422,1339920,
> 2026-05-16T10:47:56Z,63422,1405456,65536
> 2026-05-16T10:47:59Z,63422,1405456,0
> 2026-05-16T10:48:01Z,63422,1405456,0
> 2026-05-16T10:48:03Z,63422,1536528,131072
> 2026-05-16T10:48:05Z,63422,1536528,0
>
> So the total increase was about 192 KiB.
>
> After adding a targeted cleanup for the copied RemoteSlot strings, the
> same test showed:
>
> timestamp total_bytes delta
> 2026-05-16T11:04:58Z,77000,1339920,
> 2026-05-16T11:05:00Z,77000,1339920,0
> 2026-05-16T11:05:02Z,77000,1405456,65536
> 2026-05-16T11:05:04Z,77000,1405456,0
>
> So the increase dropped to about 64 KiB.
>
> With a per-retry memory context around the fetch/synchronize cycle,
> the same test stayed flat:
>
> 2026-05-16T11:09:10Z,84600,1315344,
> 2026-05-16T11:09:12Z,84600,1315344,0
> 2026-05-16T11:09:14Z,84600,1315344,0
> 2026-05-16T11:09:16Z,84600,1315344,0
> 2026-05-16T11:09:18Z,84600,1315344,0
> 2026-05-16T11:09:20Z,84600,1315344,0
>
> Looking at the memory-context dumps, the growth on HEAD was visible
> under ExprContext. The grand-total increase matched the ExprContex
> increase, which is consistent with retry-cycle allocations surviving
> for the duration of the single SQL function call.
>
> That said, the practical severity looks small. This is mainly because
> the retry loop is not a tight loop. With no slot activity,
> wait_for_slot_activity() doubles the sleep time up to 30 seconds.
>
> So after about 51 seconds it retries only about twice per minute. For
> 100 slots and no slot activity, a rough 1-hour test from the short
> runs is on the order of a few MB on HEAD, and around 1 MB with the
> manual RemoteSlot cleanup. For installations with fewer slots, it
> should be smaller.
>
> So my read is:
> the accumulation is real;
> it is modest because of the retry backoff;
> manually freeing RemoteSlot’s copied strings removes most of the
> observed growth;
> a per-retry memory context fully bounds the retry-cycle lifetime
Expectedly, the memory accumulation is much more pronounced with a
churning slot to disablet the backoff. I'll share the findings later.
--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Dean Rasheed | 2026-05-16 16:31:52 | Re: [PATCH] Fix overflow and underflow in regr_r2() |
| Previous Message | Andres Freund | 2026-05-16 15:53:49 | Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |