Re: BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18

From: Andres Freund <andres(at)anarazel(dot)de>
To: raphael(at)atmotrack(dot)fr, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18
Date: 2026-02-10 19:58:33
Message-ID: aYuLlzKTVLY9k1zB@alap3.anarazel.de
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2026-02-10 15:28:38 +0000, PG Bug reporting form wrote:
> I recently migrated my cluster with 3 dedicated servers to a new cluster. I
> was running on PG12 and I am now on PG18.1.
> I noticed an increasing memory usage on all of my 3 node, until at some
> point there is no memory left and patroni crashes on the leader, leaving the
> cluster with no available primary.
> The cluster is a Data Warehouse type using TimescaleDB, ingesting approx. 1M
> of time-serie a day.
> It appears that the memory leak is affecting both the checkpointer and
> startup (WAL replay) processes in PostgreSQL 18.0 and 18.1.
> I never had such issue on the old cluster with PG12 and the server's
> configuration and cluster usage are the same (except the upgrade of PG)
>
> SYMPTOMS:
> - Checkpointer process grows to 5.6GB RSS after 24 hours
> - Startup process on replicas grows to 3.9GB RSS
> - Memory growth rate: approximately 160-200MB per hour
> - Eventually causes out-of-memory conditions
>
> CONFIGURATION:
> - PostgreSQL version: Initially 18.0, upgraded to 18.1 - same issue persists
> - Platform: Debian 13
> - TimescaleDB: 2.23.0
> - Deployment: 3-node Patroni cluster with streaming replication
> - WAL level: logical
> - Hot standby enabled
>
> SYSTEM RESOURCES:
> RAM: 32GB
> Proc: 12 core of Intel(R) Xeon(R) E-2386G 3.50GHz
>
> KEY SETTINGS:
> - wal_level: logical
> - hot_standby: on
> - max_wal_senders: 20
> - max_replication_slots: 20
> - wal_keep_size: 1GB
> - shared_buffer: 8GB
>
> WAL STATISTICS (over 7 days):
> - Total WAL generated: 2.3TB (approximately 31GB/day)
> - Replication lag: 0 bytes (replicas are caught up)
> - No long-running transactions
>
> MEMORY STATE AFTER 24 HOURS:
> On primary:
> postgres checkpointer: 3.9GB RSS
>
> On replicas:
> postgres checkpointer: 5.6GB RSS
> postgres startup recovering: 3.9GB RSS <-- This is abnormal

The RSS slowly increasing towards shared_buffers is normal if you're not using
huge_pages. The OS only counts pages in shared memory as part of RSS once a
page has been used in the process. Over time the checkpointer process touches
more and more of shared_buffers, thus increasing the RSS.

You can use "pmap -d -p $pid_of_process" to see how much of the RSS is
actually shared memory.

To show this, here's a PS for a new backend:

ps:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
andres 2544694 0.0 0.0 8719956 25744 ? Ss 14:55 0:00 postgres: dev assert: andres postgres [local] idle

and then after reading in a relation 1.3GB relation:

andres 2544694 1.7 2.2 8720972 1403576 ? Ss 14:55 0:00 postgres: dev assert: andres postgres [local] idle

So you can see that RSS increased proportionally with the amount of touched
data.

Whereas with pmap:

pmap -d -p 2544694|tail -n 1
mapped: 8721924K writeable/private: 5196K shared: 8646284K

I think you would need to monitor the real memory usage of various processes
to know why you're OOMing.

You can use pg_log_backend_memory_contexts() to get the memory usage
information of backend processes.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message surya poondla 2026-02-10 21:51:11 Re: Possibly a bug
Previous Message Nathan Bossart 2026-02-10 16:06:25 Re: basic_archive lost archive_directory