From: | hubert depesz lubaczewski <depesz(at)depesz(dot)com> |
---|---|
To: | Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> |
Cc: | PostgreSQL General <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug |
Date: | 2025-08-21 15:13:42 |
Message-ID: | aKc3pidKUSzq-1-8@depesz.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Thu, Aug 21, 2025 at 08:04:25AM -0700, Adrian Klaver wrote:
> > > > > "For ~ 1 second there are no logs going to log (we usually have at 5-20
> > > > > messages logged per second), no connection, nothing. And then we get
> > > > > bunch (30+) messages with the same milisecond time."
> > > > > Are the 30+ messages all coming in on one connection or multiple
> > > > > connections?
> > > > Multiple connections.
> > > > > Also to be clear these are statements that are being run on the replica
> > > > > locally, correct?
> > > > What do you mean locally?
> > > I should have been clearer. Are the queries being run against the replica or
> > > the primary?
> > All to replica. Primary has its own work, of course, but the problem
> > we're experiencing is on replicas.
>
> If I am following there is more then one primary --> replica pair and the
> problem exists across all the pairs.
Not all. We have ~ 300 such clusters. The thing doesn't cause any
customer-visible issues (after all it's just 1 second delay every so
often), so it's generally overlooked when it happens.
But we were paying closer attention to one such cluster, and then couple
of other, and we've seen this behavior.
> > > How many applications servers are hitting the database?
> >
> > To be honest, I'm not sure. I have visibility into dbs, and bouncers,
> > not really into Apps. I know that these are automatically dynamically
> > scaled, so number of app server is very varying.
> >
> > I'd say anything from 40 to 200 app servers hit first layer of bouncers,
> > which we usually have 6-9 (2-3 per az).
> >
> > These go to 2nd layer of bouncers, on the db server itself.
>
> By bouncer I assume you mean something like pgBouncer, a connection pooler.
> Is it possible to determine what bouncer the queries in question are coming
> from?
From the POV of db, all queries are coming from one of N localhost
bouncers. N is usually 2…6.
From the POV of the local bouncers, the queries come from range of
remote bouncers.
Generally we haven't seen any correlation between queries coming from
specific ranges of ips. Logged queries, the ones that we see with
runtime of 1s, have comments that indicate source, and they some from
"all-around". Specifically, "DISCARD ALL" queries are generated by
bouncers themselves (both layers).
Just so that it will be clear, I don't expect anyone to be able to
diagnose the problem based on description. I'm looking more into idea
what to look for. The issue is that with the situation being pretty
short, and happening on servers with non-trivial query load, I can't do
stuff, like, for example, strace, stuff.
Best regards,
depesz
From | Date | Subject | |
---|---|---|---|
Next Message | Karsten Hilbert | 2025-08-21 15:36:07 | Q: GRANT ... WITH ADMIN on PG 17 |
Previous Message | Adrian Klaver | 2025-08-21 15:04:25 | Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug |