| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | raphael(at)atmotrack(dot)fr, pgsql-bugs(at)lists(dot)postgresql(dot)org |
| Subject: | Re: BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18 |
| Date: | 2026-02-10 19:58:33 |
| Message-ID: | aYuLlzKTVLY9k1zB@alap3.anarazel.de |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
Hi,
On 2026-02-10 15:28:38 +0000, PG Bug reporting form wrote:
> I recently migrated my cluster with 3 dedicated servers to a new cluster. I
> was running on PG12 and I am now on PG18.1.
> I noticed an increasing memory usage on all of my 3 node, until at some
> point there is no memory left and patroni crashes on the leader, leaving the
> cluster with no available primary.
> The cluster is a Data Warehouse type using TimescaleDB, ingesting approx. 1M
> of time-serie a day.
> It appears that the memory leak is affecting both the checkpointer and
> startup (WAL replay) processes in PostgreSQL 18.0 and 18.1.
> I never had such issue on the old cluster with PG12 and the server's
> configuration and cluster usage are the same (except the upgrade of PG)
>
> SYMPTOMS:
> - Checkpointer process grows to 5.6GB RSS after 24 hours
> - Startup process on replicas grows to 3.9GB RSS
> - Memory growth rate: approximately 160-200MB per hour
> - Eventually causes out-of-memory conditions
>
> CONFIGURATION:
> - PostgreSQL version: Initially 18.0, upgraded to 18.1 - same issue persists
> - Platform: Debian 13
> - TimescaleDB: 2.23.0
> - Deployment: 3-node Patroni cluster with streaming replication
> - WAL level: logical
> - Hot standby enabled
>
> SYSTEM RESOURCES:
> RAM: 32GB
> Proc: 12 core of Intel(R) Xeon(R) E-2386G 3.50GHz
>
> KEY SETTINGS:
> - wal_level: logical
> - hot_standby: on
> - max_wal_senders: 20
> - max_replication_slots: 20
> - wal_keep_size: 1GB
> - shared_buffer: 8GB
>
> WAL STATISTICS (over 7 days):
> - Total WAL generated: 2.3TB (approximately 31GB/day)
> - Replication lag: 0 bytes (replicas are caught up)
> - No long-running transactions
>
> MEMORY STATE AFTER 24 HOURS:
> On primary:
> postgres checkpointer: 3.9GB RSS
>
> On replicas:
> postgres checkpointer: 5.6GB RSS
> postgres startup recovering: 3.9GB RSS <-- This is abnormal
The RSS slowly increasing towards shared_buffers is normal if you're not using
huge_pages. The OS only counts pages in shared memory as part of RSS once a
page has been used in the process. Over time the checkpointer process touches
more and more of shared_buffers, thus increasing the RSS.
You can use "pmap -d -p $pid_of_process" to see how much of the RSS is
actually shared memory.
To show this, here's a PS for a new backend:
ps:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
andres 2544694 0.0 0.0 8719956 25744 ? Ss 14:55 0:00 postgres: dev assert: andres postgres [local] idle
and then after reading in a relation 1.3GB relation:
andres 2544694 1.7 2.2 8720972 1403576 ? Ss 14:55 0:00 postgres: dev assert: andres postgres [local] idle
So you can see that RSS increased proportionally with the amount of touched
data.
Whereas with pmap:
pmap -d -p 2544694|tail -n 1
mapped: 8721924K writeable/private: 5196K shared: 8646284K
I think you would need to monitor the real memory usage of various processes
to know why you're OOMing.
You can use pg_log_backend_memory_contexts() to get the memory usage
information of backend processes.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | surya poondla | 2026-02-10 21:51:11 | Re: Possibly a bug |
| Previous Message | Nathan Bossart | 2026-02-10 16:06:25 | Re: basic_archive lost archive_directory |