From: | hubert depesz lubaczewski <depesz(at)depesz(dot)com> |
---|---|
To: | PostgreSQL General <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug |
Date: | 2025-08-19 17:21:47 |
Message-ID: | aKSyq74zIuLS-2B5@depesz.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi,
we have following situation: Pg 14.17 (yes, I know, but it can't be
upgraded now/soon), on Ubuntu focal, in AWS cloud on EC2 server using
arm64 architecture.
All works, is fine.
Every now and then (usually every 3-5 minutes, but not through the whole
day), we see situations where every query suddently takes ~ 1 second.
I'm talkign about "prepare" for preparing statements. binds. also
"discard all".
We have logging enabled to csv log, which offers milisecond precicision.
For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time.
And they all have insane durations - 800-1300ms for virtually anything.
After such second, everything works fine, without any problems.
Up to next case.
Unfortunately due to short duration of such things, and the fact that
literally *everything* is paused for this 1 second, it's hard to
debug/diagnose.
Servers have memory that is almost 2x total db size (200gb vs. 384gb of
ram), so disk shouldn't be an issue.
Aside from this, we don't see any other problems.
Any idea how to look at it, what to look for, to be able to diagnose the
issue?
We do use some savepoints, but it's hard to tell when exactly they
happen (we usually log only queries that take more than 250ms, and
queries that use savepoints are usually much faster).
The DB server in question has ~ 150 connections, and handles, at this
time, we had ~ 40-50 ktps.
Logging is set using:
name │ setting
═══════════════════════════════════╪════════════════════════════════
log_autovacuum_min_duration │ 0
log_checkpoints │ on
log_connections │ on
log_destination │ csvlog
log_directory │ /cache/postgres_logs
log_disconnections │ off
log_duration │ off
log_error_verbosity │ default
log_executor_stats │ off
log_file_mode │ 0600
log_filename │ postgresql-%Y-%m-%d_%H%M%S.log
log_hostname │ off
log_line_prefix │ db=%d,user=%u
log_lock_waits │ on
log_min_duration_sample │ 0
log_min_duration_statement │ 0
log_min_error_statement │ error
log_min_messages │ warning
log_parameter_max_length │ -1
log_parameter_max_length_on_error │ 0
log_parser_stats │ off
log_planner_stats │ off
log_recovery_conflict_waits │ off
log_replication_commands │ off
log_rotation_age │ 60
log_rotation_size │ 51200
log_statement │ none
log_statement_sample_rate │ 0.0001
log_statement_stats │ off
log_temp_files │ 0
log_timezone │ UTC
log_transaction_sample_rate │ 0
log_truncate_on_rotation │ off
logging_collector │ on
and it doesn't seem to show anything interesting. Checkpoints happen,
but they don't seem correlated in any way.
Any ideas?
Best regards,
depesz
From | Date | Subject | |
---|---|---|---|
Next Message | Scot Kreienkamp | 2025-08-19 17:37:16 | vacuum analyze query performance - help me understand |
Previous Message | Laurenz Albe | 2025-08-19 15:16:12 | Re: analyze-in-stages post upgrade questions |